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Abstract 


The linking genotype to phenotype is the fundamental aim of modern genetics. We focus 
on study of links between gene expression data and phenotype data through integrative 
analysis. In this work, we propose three approaches. 

1) The inherent complexity of phenotypes makes high-throughput phenotype profil¬ 
ing a very difficult and laborious process. We propose a method of automated multi¬ 
dimensional profiling which uses gene expression similarity. Large-scale analysis of more 
than 500 gene expression datasets show that our method can provide robust profiling that 
reveals different phenotypic aspects of samples. This profiling technique is also capable 
of interpolation and extrapolation beyond the phenotype information given in training 
data. As such, it can be used in many applications, including facilitating experimental 
design and detecting confounding factors. 

2) Phenotype association analysis problems are complicated by small sample size and 
high dimensionality. Consequently, phenotype-associated gene subsets obtained from 
training data are very sensitive to selection of training samples, and the constructed sam¬ 
ple phenotype classifiers tend to have poor generalization properties. To eliminate these 
obstacles, we propose a novel approach that generates sequences of increasingly discrim¬ 
inative gene cluster combinations. Our experiments on both simulated and real datasets 
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show that the resulting cluster combinations are robust to perturbation in training sam¬ 


ples and have good sample phenotype classihcation performance. 

3) Many complex phenotypes, such as cancer, are the product of not only gene expres¬ 
sion, but also gene interaction. We propose an integrative approach to find gene network 
modules that activate under different phenotype conditions and apply this approach to 
the study of cancer. Using our method on multiple cancer gene expression datasets, we 
discovered cancer subtype-specific network modules, as well as the ways in which these 
modules coordinate. In particular, we detected a breast-cancer specific tumor suppressor 
network module with a hub gene, PDGFRL, which may play an important role in this 
module. 
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Chapter 1 


Introduction 

1.1 Integrative analysis of gene expression and phenotype 
association 

With the advancement of the high-throughput genotyping technologies, large amounts of 
genotype data have been accumulated in recent years. Such data include gene expres¬ 
sion, single nucleotide polymorphism, copy number variation, DNA mythelation, histone 
acetylation, alternative splicing, and proteomics. Since the fundamental problem in mor- 
den genetics is the linking of genotype to phenotype, we are particularly interested in 
the associations between gene expression and phenotype in this dissertation. 

1.2 Gene expression and phenotype data 

With the completion of the Human Genome Project, biology research is entering the post 
genome era. Although biologists have collected a vast amount of DNA sequence data, 
the details that would explain how these sequences function still remain largely unknown. 
Genomes of even the simplest organisms are very complex. Key questions still confront 
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biologists, including 1) the functional roles of different genes and the cellular processes 
in which they participate; 2) how genes are regulated, how genes and gene products 
interact, and how to identify interaction networks; and, finally, 3) how gene expression 
levels differ in various cell types and states, as well as how gene expression is changed by 
various disease pathologies and their treatments. Biology used to be a data-poor science. 
Nowadays, however, many of these questions can be addressed with the advent of more 
advanced techniques, such as gene expression. Therefore, investigators are now able to 
transform vast amounts of biological information into useful data. 

This makes it possible to study gene function globally, and a new field, functional 
genomics, emerges. Specifically, functional genomics refers to the development and ap¬ 
plication of global (genome-wide or system-wide) experimental approaches to assess gene 
function by making use of the information and reagents provided by structural genomics. 
Gene function may be divided into gene expression, which refers to the process of tran¬ 
scribing a gene’s DNA sequence into the RNA that serves as a template for protein 
production, and the level of gene expression, which indicates how active a gene is in cer¬ 
tain tissues, at certain times, or under certain experimental conditions. Therefore, the 
monitoring of gene function can provide an overall picture of the genes being studied, in¬ 
cluding their expression level and the activities of the corresponding protein under certain 
conditions. To accomplish this, functional genomics provides techniques characterized by 
high-throughput or large-scale experimental methodologies combined with statistical and 
computational analysis of the results. 

Among the several methods that have been developed to understand the behavior 
of genes, microarray technology is particularly important. It is used to simultaneously 
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monitor the expression levels of genes by several orders of magnitude, and many steps 
are involved in this technology. First, complementary DNA (cDNA) molecules or oligos 
are printed onto slides as spots. Then, two kinds of dye-labeled samples, i.e., sample and 
control, are hybridized. Finally, the hybridization is scanned and stored as images (Figure 
1.1). Using a suitable image processing algorithm, these images are then quantified into a 
set of expression values representing the intensity of spots. Usually, the dye intensity may 
be biased by factors, such as physical property, experimental variability in probe coupling 
and processing procedures, and scanner settings. To minimize the undesirable effects 
caused by this biased dye intensity, normalization is done to balance dye intensities and 
make expression value comparable across experiments. Here the term comparable means 
that the difference of any measured expression value of a gene between two experiments 
should reflect the difference of its true expression levels. 

A phenotype is any observable characteristic of an organism. Phenotypes result from 
the activities of an organism’s genes as well as the influence of environmental factors 
and possible interactions between the two. Phenotypes are often described in natural 
languages in the literature. They can sometimes be measured as continuous or categorical 
values. While it has thus far been very difficult to automatically extract phenotype 
information from the literature, the Unified Medical Language System (UMLS) has been 
developed in recent years to provide facilities for natural language processing. As such, 
UMLS is a compendium of many controlled vocabularies in the biomedical sciences. It 
provides a mapping structure among these vocabularies and thus gives researchers the 
ability to translate among the various terminology systems. It is also a comprehensive 
thesaurus and ontology of biomedical concepts and consists of the following components: 
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Figure 1.1 A diagram illustrating how gene expression is obtained from microarray 
experiments. Figure obtained from http://www.mun.ca 


4 









1) a Metathesaurus, the core database of the UMLS, which is a collection of concepts 


and terms from the various controlled vocabularies and their relationships; 2) a Semantic 
Network which is a set of categories and relationships that are used to classify and relate 
the entries in the Metathesaurus; and 3) a SPECIALIST Lexicon which is a database of 
lexicographic information for use in natural language processing. A number of supporting 
software tools are also provided for this system. 


1.3 Integrative analysis of gene-phenotype 
association 

In recent years, the accumulation of gene expression data has been a rapid process. For 
example, the two largest public gene expression repositories. Gene Expression Ominibus 
(GEO) and Array Express, already contain more than 500,000 samples, measuring genome 
wide gene expression levels of various organisms under various phenotypic conditions. 

Basically, these gene expression data repositories contain two types of information: 
gene expression and sample phenotype information. Expression information based on 
microarrays are measured in continuous values. On the other hand, sample phenotype 
information is often described in natural language, or it can be manually structured and 
recorded into a database. We develop methods of integrative analysis which combine 
these two types of information obtained from different experimental sources, therefore 
enhancing confidence and providing a better understanding of the association between 
genotype and phenotype. 
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1.4 Summary of our work 


As indicated above, we propose three approaches for the above analysis. 1) We propose 
the concept and a method of Automated Multi-dimensional Phenotypic Profiling by using 
genotype information as a bridge. 2) We propose a method that can robustly search for 
combination of gene sets that provides true discrimination and good prediction of sample 
phenotype classes. 3) We propose an integrative approach to characterize disease-specific 
pathways and their coordination using gene coexpression network analysis. The first work 
primarily focus on phenotype prediction. Both the second and third work above involve 
studying different aspects of gene phenotype associations. The second work focus on 
selecting phenotype associated combinations of gene sets, while the third work particularly 
focus on genetic interaction mechanism that is associated to phenotype difference. 

1.4.1 Automated multi-dimensional phenotype profiling 

Phenotypes are complex and difficult to quantify in a high-throughput fashion. The lack 
of comprehensive phenotype data can prevent or distort genotype-phenotype mapping. In 
Chapter 2 we describe PhenoProfiler, a novel computational method that enables in sili¬ 
con phenotype profiling. Drawing on the principle that similar gene expression patterns 
are likely to be associated with similar phenotype patterns, PhenoProfiler supplements 
the missing quantitative phenotype information for a given microarray dataset based on 
other well-characterized microarray datasets. 

We applied our method to 587 human microarray datasets covering more than 14,000 
samples, and confirmed that the predicted phenotype profiles are highly consistent with 
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true phenotype descriptions. PhenoProfiler offers several unique capabilities; (1) au¬ 


tomated, multi-dimensional phenotype profiling, facilitating the analysis and treatment 
design of complex diseases; (2) the extrapolation of phenotype profiles beyond provided 
classes; and (3) the detection of confounding phenotype factors that would otherwise 
bias biological inferences. Finally, as no direct comparisons are made between gene ex¬ 
pression values from different datasets, the method can utilize the entire body of cross¬ 
platform microarray data. This work has produced a compendium of novel phenotype 
profiles for the NCBI GEO datasets, which can facilitate an unbiased understanding of 
transcriptome-phenome mapping. By increasing the variaty and the quality of pheno¬ 
types to be profiled, the continued accumulation of microarray data will further increase 
the power of PhenoProfiler. 


1.4.2 Stable refinement of discriminative gene clnsters 

Gene expression data are often used to identify the phenotype associated genes that are 
differentially expressed between samples of two phenotype classes. On the other hand, 
since microarray datasets contain only a small number of samples and a large number of 
genes, it is usually desirable to identify small gene subsets with distinct patterns between 
sample phenotype classes. Such gene subsets are highly discriminative in phenotype clas¬ 
sification because of their tightly coupling features. Unfortunately, gene subsets obtained 
from training data are often very sensitive to the selection of training samples, and the 
classifiers constructed from such gene subsets usually tend to have poor generalization 
properties on the test samples as a result of the overfitting problem. 
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In Chapter 3, we propose a novel approach [XZZ08] to correct these problems by 
combining supervised and unsupervised learning techniques to generate increasingly dis¬ 
criminative gene cluster combinations in an iterative manner. Compared with existing 
approaches, our experiments on both simulated and real datasets show that our method 
can produce a series of gene cluster combinations that are robust to training sample per¬ 
turbation and have good sample phenotype classification performance. This backward 
approach for refining a series of highly discriminative gene cluster combinations for the 
purpose of sample phenotype classification has proven to be very consistent and stable 
when applied to various types of training data. In addition, these combinations can be 
further used to study the underlying genetic interactions that lead to phenotype differ¬ 
ence, and extended to integrate gene clusters obtained from multiple datasets consisting of 
similar kinds of phenotype classification studies. Thus, the confidence that these clusters 
are truly discriminative can be further enhanced. 

1.4.3 Disease-specific pathway identification 

The most common application of microarray technology in disease research is to identify 
genes differentially expressed in disease versus normal tissues. However, in the case of 
complex diseases, it is well known that, phenotypes are determined by the underlying 
structure of genetic networks, not by individual genes. Thus in many situations, it is the 
interaction of many genes that plays an important role in causing phenotypic variations. 
In Chapter 4, using cancer as an example, we develop a graph-based method [XKNI'’'08] 
to integrate multiple microarray datasets to discover disease-related co-expression network 


modules. We propose an unsupervised method that takes into account both co-expression 



dynamics and network topological information to simultaneously infer network modules 
and phenotype conditions in which they are activated or de-activated. 

Using our method, we have discovered network modules specihc to cancer or subtypes 
of cancers. Many of these modules are consistent with or supported by their functional 
annotations or their previously known involvement in cancer. In particular, we identi¬ 
fied a module that is predominately activated in breast cancer and is involved in tumor 
suppression. While individual components of this module have been suggested to be as¬ 
sociated with tumor suppression, their coordinated function has never been elucidated. 
Here by adopting a network perspective, we have identified their interrelationships and, 
particularly, a hub gene PDGFRL that may play an important role in this tumor sup¬ 
pressor network. Consequently, by using a network-based approach, our method provides 
new insights into the complex cellular mechanisms that characterize cancer and cancer 
subtypes. By incorporating co-expression dynamics information, our approach not only 
extracts more functionally homogeneous modules than those based solely on network 
topology, but also reveals pathway coordination beyond co-expression. 
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Chapter 2 


Automated Multi-dimensional Phenotypic Profiling 


2.1 Introduction 


The fundamental aim of modern genetics is linking genotype to phenotype. With the 
rapid accumulation of genomics data, the lack of phenotype data has become the bot¬ 
tleneck of this process [FS03]. Phenotyping, especially for human subjects, is a labo¬ 
rious process [LL07]. Moreover, researchers often gloss over the complexity of human 
phenotypes by reporting only those traits specifically relevant to their studies. For ex¬ 
ample, a given dataset may provide survival information but not the patients’ ages. In¬ 
ferences derived from such data could be biased or even invalidated by undocumented or 
poorly documented phenotypic traits. Furthermore, most available phenotype character¬ 
izations are qualitative (categorical) rather than quantitative (continuous). This practice 
is problematic for two reasons: the boundaries between categories are often vague or ar¬ 
bitrary [OHB08], and any phenotypic information distinguishing data within a category 
is lost. 
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In this paper, we address the above issues by developing PhenoProfiler, a computa¬ 


tional framework for predicting the quantitative phenotype information missing from a 
genomic dataset. In particular, this method associates each sample of a given dataset 
with the relative intensity of a specific phenotype trait. The quantitative measures of 
samples across the whole dataset is referred to as a phenotype profile (PP). Examples 
include the body weights of individuals, degrees of malignancy in tumor samples, and the 
quantitative responses of patients to drug treatments. 

The principle of PhenoProfiler is that similar genomic patterns are likely to be asso¬ 
ciated with similar phenotypic patterns [BvD04]. Thus, we can supplement the (incom¬ 
plete) phenotypic information in a given genomics dataset with traits recorded in other 
well-characterized datasets. In particular, we focus on the vast accumulated microarray 
data. The NCBI Gene Expression Omnibus (GEO) [BTW“*'07], for example, currently 
contains more than 2000 human microarray datasets that systematically document the 
transcriptome basis of phenotypes as diverse as heart diseases, mental illness, infectious 
diseases, and a variety of cancers. 

The intuition behind our method is as follows. Given a training dataset with known 
sample description of a phenotype P, for each gene we can derive an association between 
its expression profile and this phenotype P. We denote those genes as signature genes, the 
expressions of which are strongly associated with the phenotype in the training dataset. 
Given a new microarray dataset that is known to be related to the phenotype P, but 
the phenotype description of its individual samples are unknown, we aim to estimate the 
PP by constructing a sample profile as a real-valued vector that is most similar to the 
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expression profiles of those signature genes in the new dataset. Figure 2.1 illustrates this 


approach. 

Since the information we borrow from the training dataset is only the association 
between the gene expressions and sample phenotypes, we do not directly compare the ex¬ 
pression values between the training and prediction datasets, thus bypassing the data in¬ 
compatibility problem between cross-platform and cross-laboratory microarray datasets. 
PhenoProfiler can therefore use as many as possible microarray datasets in the public 
repositories in the training stage, and construct a new dataset’s prohles for a wide range 
of phenotypes. 

We applied our method to 587 human microarray datasets, covering more than four¬ 
teen thousand microarray samples. The predicted phenotype profiles were highly con¬ 
sistent with known phenotype descriptions. We showed that PhenoProfiler can robustly 
provide multi-dimensional characterization of the phenotypes missing from a dataset, and 
can facilitate the discovery of confounding factors for the transcriptome-phenotype map¬ 
ping. The comprehensive phenotypic data generated by this method will vastly increase 
the value of published and forthcoming genomics data. 


2.2 Results 

2.2.1 Overview of method 

As illustrated in Figure 2.1, PhenoProfiler consists of two steps. (1) Given a microarray 
dataset Di whose samples have known descriptions of the phenotype P, for each gene i 


12 



Training Dataset 



gi 

92 

93 

94 



91 

92 

93 

94 

Phenotype Profiie 

Sampies reordered according to 
Phenotype Profiie 




Figure 2.1 The principles of PhenoProfiler. Each row of an expression matrix corre¬ 
sponds to a gene, and each column corresponds to a sample. The magnitude of gene 
expression is indicated by a color scale running from green (low) to red (high). From the 
training dataset (left), we obtain for each gene i a coefficient Wi describing the degree 
of association between its expression level and the phenotype. Here, genes 1 and 2 have 
high positive coefficients. Gene 3 shows no clear association with the phenotype, so its 
coefficient is close to zero. Gene 4 has a high negative coefficient. Therefore, genes 1,2 
and 4 are signature genes of the phenotype. Given a new dataset, for each sample we 
aim to estimate the relative intensity of the association between this sample and the 
phenotype such that the derived intensity values of all sample are most similar to the ex¬ 
pression values of signature genes. We term such a sample profile as a phenotype profile. 
The phenotype profile vector is depicted by the grey cells, where brighter colors indicate 
estimated tighter association with the phenotype. After sorting the samples based on the 
phenotype profile, it can be seen that the expression profiles of genes 1 and 2 are strongly 
correlated with phenotype profile, while that of gene 4 is anti-correlated. 


13 







































we calculate a coefficient Wi that indicates the degree of association between its expres¬ 


sion profile and the phenotype P. The appropriate way to calculate Wi depends on the 
structure of the phenotype descriptions. If the phenotype description is binary, i.e. the 
dataset compares two phenotype groups, we can simply use the two-sample t-statistic as 
Wi- If the phenotype description is continuous, we can use the correlation between the 
phenotype values and the gene expression values as Wi. Genes with a large magnitude 
of Wi are termed the signature genes of the phenotype P. (2) In order to predict the 
phenotype profile (PP) of a new microarray dataset D 2 , we use the following constrained 
optimization approach. For a dataset with m individual samples, the PP is defined as a 
normalized, real-valued vector of m values {pi,p 2 , ■ ■ ■ ,Pm)'^ (denoted p). We also define 
the normalized expression vector of the gene i as {eii,ei 2 , ■ ■ ■ ,eim)'^, denoted ep, and we 
denote the expression matrix containing all such expression vectors as E. Given a set of 
coefficients Wi determined from the training data, the objective is to find a profile p that 
minimizes the weighted least-squares difference min^^ jrcil ^j(sgn(rt;j)eij -Pjf. (The 
function sgn is -|-1 or —1 depending on the sign of its argument; we want the phenotype 
profile p to be close to e* when Wi is positive, and close to —e* when Wi is negative.) A 
series of matrix computations (see the Methods section) yields the optimal solution p, 
which is essentially the normalized weighted (by Wi) sum of gene expression values across 
samples. 

To assess whether the predicted profile p captures the expression trend of those sig¬ 
nature genes, we calculate an association score c, defined as the Pearson’s correlation 
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between the coefficients w and Ep (detailed explanation in Methods). To assess the sta¬ 
tistical significance of p, we compare the c to those calculated using the same expression 
matrix E and 1000 random permutations of coefficients w. 

2.2.2 An illustrative case: reconstructing the temporal order of the 
yeast log-to-stationary growth transition 

As an illustrative example, consider the two microarray datasets GDS18 and GDS283. 
Both study the log-to-stationary growth transition of yeast, but with different microarray 
platforms. Both datasets measure gene expression starting at the logarithmic phase and 
extending through the stationary phase. Here our goal is to predict changes in yeast 
phenotype from log to stationary transition, responding to the depletion of nutrients. 
Naturally, the temporal order of the samples serves as a good means of validating the 
prediction. 

Using one dataset for training, we computed the Spearman’s rank correlation between 
individual gene expression profiles and the temporal order of the samples. These statistics 
are used as the training coefficients. In the other dataset, we then predicted the phenotype 
response profile based on gene expressions, hoping to recover the correct temporal order of 
samples. In both cases, the predicted PP was highly consistent with the actual sequence 
of samples (Spearman’s rank correlation were 0.83 and 0.79, depending on which dataset 
was used for training). Figure 2.2 shows the predicted and the original sample order of 
dataset GDS18. Two subgroups are visible in the predicted profile, accurately reflecting 
the logarithmic and stationary phases. The sole exception is at the transition between 
the two phases. 
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Figure 2.2 Consistency between predicted and true phenotype profiles in GDS18. Using 
GDS283 as the training dataset, the predicted phenotype profile of GDS18 closely matches 
the original temporal order of the samples. The original temporal order is measured as 
the logarithm (base 10) of minutes. 
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Intriguingly, experiment GDS283 stopped taking measurements only 17 hours after 


the yeast entered the stationary phase. Experiment GDS18, on the other hand, continued 
measurements for another 12 days. It is remarkable that a phenotype signature derived 
from GDS283 can accurately sort the phenotype progression of GDS18. This result 
demonstrates that the essential physiological changes occurring within and between the 
logarithmic and stationary growth phases can be extrapolated as well as interpolated. 

2.2.3 Large-scale prediction of phenotype profiles 

To test the general applicability of our method, we performed a large-scale analysis of 
587 human microarray datasets (see the Methods section for details on data collection 
and processing). Datasets containing at least two disjoint sample groups, representing 
a phenotype and its baseline {P and P'), each with at least 10 samples, were selected 
as training datasets (Di). If a dataset contains n sample groups, we can generate ( 2 ) 
distinct training datasets. Since the phenotype values in these datasets are categorical, 
we use the two-sample t-statistic as coefficients w. By setting the threshold p-value for 
the predicted PP to 0.001 and the association score c > 0.25, a total of 37,852 novel PPs 
were associated to the 587 datasets. 

To validate the method, for each training dataset Di we also need a testing dataset 
D 2 that contains the sample descriptions on exactly the same phenotypes P and P'. 
Among all 587 datasets, we only identified 4 training-testing dataset pairs meeting this 
criterion, in which each of the testing datasets also contains two sample groups of P 
and P'. To assess whether the predicted phenotype profile is consistent with the known 
distribution of phenotypes P and P' in the testing dataset, we used the Wilcoxon rank 
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sum test. Specifically, in the test dataset, the two sample groups’ {P and P') predicted 
phenotype values are compared using Wilcoxon rank sum test. A small Wilcoxon p- 
value indicates that there is a significant difference between the distributions of predicted 
phenotype values for the two groups, therefore the predicted profile is consistent with 
known phenotype information. Among the 4 training-testing pairs, all predicted PPs 
were highly consistent with the known phenotype groups (Wilcoxon p-value < 10“^). 

In order to obtain a general assessment using more validation data, we relaxed the 
requirement that the description of the testing dataset exactly matches the training data 
phenotypes. In fact, if the phenotypes of a given dataset were even moderately similar 
to the training phenotype, the predicted profiles were found to agree well with known 
phenotype groups in the testing dataset. This implies a strong interdepedence among 
related phenotypes. We quantify the similarity between the training and testing phe¬ 
notype with two measures: (1) the percentage 7 of Unified Medical Language System 
(UMLS) concepts of the merged sample group descriptions of Di shared with the dataset 
description of D 2 ; and ( 2 ) the similarity between the descriptions of corresponding sample 
groups in Di and D 2 , denoted as s. The latter is defined as the cosine of the angle be¬ 
tween two term frequency-inverse document frequency (tf-idf) vectors of mapped UMLS 
terms (See the Methods section for details). Following these measurements, we iden¬ 
tified 32 training-testing dataset pairs with similarity thresholds s > 0.4 and 7 > 0.6. 
Among these, 81% of predicted phenotype profiles were consistent with prior phenotype 
descriptions (Wilcoxion test p-value < 0.05). This result highlights the effectiveness of 
our method in exploiting the interdependence of similar phenotypes. 
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We further studied the robustness of our method against the size of the training 


dataset. We randomly selected a training dataset and a test dataset, and then calculated 
the correlation between the PP constructed with the original training dataset and that 
with certain amount of training samples randomly removed. Repeating this test 10,000 
times with 10% (and 20%) sample removal produced an average correlation of 0.98 (and 
0.95) between the resulting PPs and those without any samples removed. Even for those 
training datasets with a small size of 10 samples in each of the two phenotype groups, 
the obtained PP correlations were still greater than 0.9 for both 10% and 20% sample 
removal, demonstrating the robustness of our method. 

2.2.4 Multi-dimensional profiling of complex phenotypes 

As previously mentioned, a total of 37,852 PPs were derived and assigned to the 587 
datasets. On average, each dataset is assigned 65 PPs. In some cases, related train¬ 
ing datasets generated highly correlated PPs, further enhancing our confidence in the 
prediction. Two examples are described below. 

Dataset GDS2855 studies various forms of muscular dystrophy. Three training sets 
(GDS609, 610, and 612) generated highly correlated PPs (average correlation 0.88) for 
GDS2855. All three training datasets describe the difference between Duchenne mus¬ 
cular dystrophy and normal muscle tissues, although they were measured with different 
platform technologies. Furthermore, all 3 predicted PPs were highly consistent with the 
original sample description of GDS2855 (Wilcoxion test p-value < 10“®). 

Dataset GDS1962 studies gliomas of different grades, and was assigned four highly 
correlated PPs (average correlation 0.9) by datasets GDS1975, 1976, 1815, and 1816. 
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All four training datasets focused on comparing grade III and grade IV glioma samples. 


Remarkably, the predicted PPs not only did a good job of separating grade III from grade 
IV samples in the testing dataset, but also separated grade II from grade III samples. In 
addition, the separations followed the order of tumor grades. This example shows that our 
method captures the essential difference between high- and low-grade tumors, and thus 
can be extrapolated to tumors of grades beyond those represented in the training data. 
This ability to extrapolate from the training dataset represents a significant advantage 
over traditional classification methods. 

A testing dataset is often (78% of cases) assigned multiple uncorrelated PPs (correla¬ 
tion < O.I) describing different properties of a complex phenotype. For example, dataset 
GDS843 contains 49 samples comparing patients with abnormal karyotypes to patients 
with normal karyotypes to study adult acute myeloid leukemia (AML). The samples were 
collected from peripheral blood or bone marrow. Its predicted phenotypes include three 
uncorrelated profiles (see Figure 2.3), which are detailed below. 

1. Training dataset GDS842 also studied abnormal versus normal karyotypes in adult 
AML patients. The derived phenotype profile is consistent with known sample 
description of this phenotype in GDS843 (Wilcoxon p-value 0.04), thus validating 
our method. 

2. Training dataset GDS2118 compared individuals with refractory anemia to normal 
individuals. The PP trained by this comparison is highly correlated (correlation 
> 0.9) with two other PPs that also come from training datasets that studied re¬ 
fractory anemia. In fact, the recently proposed WHO classification of hematologic 
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Figure 2.3 Multi-dimensional profiling of the dataset GDS843 that compares patients of 
adult AML with abnormal karyotypes to those with normal karyotypes. Three different 
training datasets produced mutually uncorrelated phenotype prohles (black curves) that 
could be assigned to GDS843. The three most significantly correlated expression profiles 
of genes known to be related to the respectively profiled phenotypes are superposed in 
the three primary colors. For clarity, the samples are ordered according to the first PP, 
and the expression profiles of negatively correlated expression profiles have been reversed. 


21 












malignancies merged the disease refractory anemia with excess blasts in transfor¬ 
mation (RAEB-T) into the category AML. However, this new disease classification 
is controversial. Although RAEB-T and AML share similar clinical parameters, a 
study pointed out that their biological bases are different (e.g. RAEB-T is distin¬ 
guished from AML by a significant increase in apoptosis), and it suggested that 
RAEB-T should be regarded as a distinct disease entity [ABO'’“00]. Therefore, 
the derived PP may uncover hidden patient information and possibly help to dif¬ 
ferentiate RAEB-T from AML, which could further lead to improved treatment 
design. 

3. Training set GDS1221 studies the patient response to the drug Imatinib. Imatinib 
was designed to treat chronic myeloid leukemia by reducing the tyrosine kinase 
activity of the well-known bcr-abl fusion gene. Our phenotype profile could there¬ 
fore be used to identify patients that would be more likely to respond to Imatinib 
treatment. 

In summary, while the first PP serves as an internal validation of the method, the 
other two PPs provide novel insights into the pathologic and therapeutic properties of 
sample phenotypes in the dataset GDS843. The specific phenotype properties represented 
by the above PPs can be further confirmed by examining genes whose expressions are 
significantly correlated (p-value < 0.001) with these profiles. For example, the AML PP 
(number 1 above), has 10 significantly correlated genes that are known to be associated 
with the UMLS concept “Leukemia, Myelocytic, Acute.” A particularly interesting gene 
is FLT3. A study suggested that in patients with karyotype alterations, a recipricol 


22 



translocation was not sufficient to cause acute promyelocytic leukemia, and that an ad¬ 
ditional mutation in FTL3 may be required [DLCBPS'’“08]. For the refractory anemia 
activation profile, there are 12 significantly correlated genes known to be involved in “Ane¬ 
mia.” These include three Fanconi Anemia genes (FANCA, FANCD2, FANCG) as well 
as TGFBl, which may affect the progression of refractory anemia specifically [BBG^05]. 
Among the genes correlated with the Imatinib response profile, three are tyrosine kinases, 
which is consistent with the target of Imatinib [DD03]. This example demonstrates the 
power and specificity of our multi-dimensional phenotype profiling approach by utilizing 
the large number of microarray datasets. 

2.2.5 Discovery of hidden confounding factors in microarray studies 

Due to the scarcity of phenotype information in many microarray datasets, confounding 
phenotype variables may not be well documented. Thus, caution should be exercised in 
deriving inferences from microarray datasets. The following cases provide representative 
scenarios. 

Dataset GDS1673 examines normal lung tissue from 23 donors, including smoking and 
non-smoking individuals. Interestingly, we found that a predicted PP trained on male vs. 
female skeletal muscle samples (dataset GDS914) was able to separate the smoking and 
non-smoking samples of GDS1673 (Wilcoxon p-value 0.0002). After obtaining additional 
phenotype information on the GDS1673 subjects, it turns out that among non-smokers, 
which made up almost two-thirds of the sample, females outnumbered males by more than 
two to one, while among smokers the numbers of the two genders were approximately 
equal. Thus, simply comparing the expression profiles of the smoking versus non-smoking 
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groups would not derive the signature of smoking, but rather the mixed signatures of 


smoking and gender. 

As another example, the goal of the GDS1887 study was to build a prognosis model 
for rectal cancer cells responding to radio therapy. According to the GEO annotation, its 
46 samples had been separated into training and test groups for model construction and 
validation. Surprisingly, we found that the original training and test samples from this 
study could be well separated (average Wilcoxon p-value 0.001) by four highly correlated 
PPs (average correlation 0.95). All of those four PPs come from training datasets that 
compare the cancer to normal tissue or that compare cancers of different malignancies. 
This strongly suggest that there were systematic differences in cancer malignancy between 
the training and testing samples, even though they were supposed to be generated by 
random partition. Any such sampling bias would negatively impact the accuracy of the 
prognosis model. 

Of course, sampling bias can often be traced to the very limited availability of phe¬ 
notype data in the first place. Our compendium of predicted phenotype profiles 
(http://zhoulab.usc.edu/PhenoProfiler) provides a comprehensive description of a large 
proportion of the datasets in the GEO database. This knowledge can facilitate an un¬ 
biased understanding of the transcriptome-phenome mapping. It can also serve as the 
starting point for the identification of molecular mechanisms shared by different diseases 
and phenotypes. 
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2.3 Discussion 


Phenotypes are complex, and difficult to quantify in a high-throughput fashion. The lack 
of comprehensive phenotype data can prevent or distort genotype-phenotype mapping. 
This paper describes a novel approach to perform in silicon phenotype profiling. Our 
method provides numerous advantages which we outline here. (1) For most datasets we 
were able to predict multiple phenotype profiles, which could help researchers to reveal 
different aspects of complex diseases and facilitate treatment design. (2) We can provide 
a quantitative phenotype description of the sample characteristics. Although “categor¬ 
ical” phenotype description is prevalent, in reality phenotypes constitute a continuous 
spectrum. (3) Our method can extrapolate the profiling to classes beyond those repre¬ 
sented in the training data, as illustrated in the glioma case study. This is an advantage 
over traditional classification methods. (4) PhenoProfiler avoids direct comparison of 
gene expression values from different datasets, and thus can utilize almost all available 
microarray data regardless of platform or laboratory. In contrast, traditional regression 
methods cannot be directly applied to microarray datasets from different platforms. 

The continued accumulation of microarray data will further increase the power of 
PhenoProfiler in two aspects: the variety of phenotypes to be profiled, and the confidence 
of its predictions. The latter benefit derives from having several mutually correlated 
PPs from similar datasets. The principles of our method can be easily applied to other 
types of genomics data (e.g. proteomics or metabolomics) as they become increasingly 
available. The present work focuses on linear gene-phenotype associations, but more 
complex relationships can be devised depending on the data characteristics. 
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Our univariate method for selecting the weights from the training sample is only one 


of many possible approaches. For example, one could consider constructing weights using 
a multivariate procedure that takes into account correlations among the gene expression 
levels, such as Fisher’s linear discriminant procedure (i.e Discriminant Function Analysis 
for two groups). However, such an approach requires estimating a covariance matrix for 
the gene expressions which is not practical given that there are thousands of genes and 
a limited number of samples (typically on the order of 10) per dataset. In fact Fan and 
Fan [FF08] prove that, when the dimension of the feature space is high, a univariate 
two-sample t-test procedure, similar to our approach, is often superior to a multivariate 
method. Alternatively, when the important phenotype information can be characterized 
using a small number of linear combinations of the genes, dimension reduction techniques 
like Nonnegative Matrix Factorization [BTGM04,TSE+07] may also produce meaningful 
phenotype predictions. 


2.4 Methods 

2.4.1 Predicting phenotype profiles by constrained 
optimization 

From a training microarray dataset, we derive a vector w = (zci, rc 2 ,..., iCn)^ that con¬ 
tains the gene-phenotype association coefficients of n genes. Given a new dataset with n 
genes and m samples, and the normalized gene expression matrix E = {eij)nxm, we aim 
to obtain the optimal Phenotype Profile (PP) of the m samples, where PP is a normalized. 
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real-valued vector p = (pi,p 2 , • • • ,PmY' that show high similarity to the expression pro¬ 
files of those genes that have high magnitude of gene-phenotype association coefficients 
w (termed signature genes) 


Qi : min d{E, w, p) = \wi\y^(sgn{wi)eij - pjY 

pgjjrn ■> J 


subject to E Pi = 0 


m — 




Let b = ( 6 i, 62 , • • •) be the weighted sum of gene expression values for each 
sample, bj = 'YZi'^ieij- The following theorem provides the solution to the minimization 
problem. 


Theorem: The solution p to problem Qi is a vector that is the normalized form of 
b. That is, 

. b-b 

where b and cj(b) are the mean and standard deviation of b respectively. 

Proof: By expanding the function d, we have 


d(E, W, p) = ^ \wi\ el + E 1^*1 2 E E ®bTi 

i j i j i j 
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Since p is normalized and E and w are fixed, the first two terms are fixed. So the 


minimization problem Qi can be simplified to an equivalent maximization problem: 


Q 2 : max c(E,w,p) = w'^Ep 

peM™ 

subject to p^l = 0 

rp 

p p = m — 1 

where 1 is the vector whose elements are all 1 . 

Let b = E^w. Let the Lagrangian function for Q 2 be 

L(p, Ai, A 2 ) = b^p + Alp'll + A 2 (p'^p - (m - 1)) 

where Ai and A 2 are Lagrangian multipliers. According to the Karush-Kuhn-Tuker con¬ 
ditions [BSS93] (as the functions b^p, p^l, and p^p are all convex), the solution to 
Equation 2.1 contains the global optimum of Q 2 -, 


VL(p,Ai,A2) = 0 (2.1) 

Equation 2.1 results in two solutions: p = Since <52 is a maximization problem, 

it is easy to show that the solution of <52 is p = and so is the solution of Qi. 

p is regarded as the PP of the new dataset because among all vectors in p is the 
one that most resembles the normalized expression prohles of the signature genes that 
were dehned by the training data. 
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We calculate an association score c as the Pearson correlation between w and Ep. 
The score is derived from the maximization problem Q 2 . Ep provides the association 
between expression profiles and the predicted phenotype profile in the testing dataset. 
Thus, higher c indicates higher consistency of gene-phenotype associations derived from 
the training and testing datasets. 

2.4.2 Data collection and processing 

We collected 587 human microarray datasets, each containing at least 5 samples, from 
the NCBI Gene Expression Omnibus (GEO) [BTW^07]. For data generated with the 
Affymetrix platforms, we increased any values less than 10 to 10 and performed a log 
transform of the gene expression values. For genes with multiple probesets present, the 
expression values of those probesets were averaged. We then normalized each dataset by 
converting the expression values of each gene to Z-scores (zero mean and unit variance). 
The 587 datasets yielded 537 training datasets. A training dataset contains two disjoint 
sample groups, each of which contains at least 10 samples, representing a phenotype and 
its baseline. When performing PP prediction, we discard those training-testing dataset 
pairs that share fewer than 100 genes in common. 

2.4.3 Automatic processing of phenotype annotations 

In GEO, a dataset is usually annotated by a short description paragraph; a sample group 
is annotated by a word or a short phrase; and a sample is usually annotated by one 
sentence. To systematically categorize the phenotype information associated with each 
microarray dataset, we used the Unified Medical Language System (UMLS) [Bod,BK06]. 
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We mapped the dataset description, sample group descriptions, and sample descriptions 
onto UMLS concepts via the MetaMap Transfer program [AroOl]. To reduce noise we 
focused on disease-related concepts, including the MeSH vocabulary and the semantic 
types “Pathologic Function”, “Injury or Poisoning”, “Anatomical Abnormality”, “Body 
Part, Organ, or Organ Component”, “Tissue”, and “Cell”. In general, the higher a 
concept is on the UMLS hierarchy, the broader is the concept. Disease concepts at the 
fine granularity level may be associated with more clinical significance. In order to infer 
higher-order links between datasets, all of the ancestor concepts of mapped concepts were 
included. 


2.4.4 Measuring phenotype annotation similarity 

We measure phenotype similarity between sample groups (the two groups of a training 
dataset and the two groups of a testing dataset) by the following procedure. 1) For each 
group, we map its title and member sample descriptions onto UMLS concepts. 2) We 
then construct a term frequency-inverse document frequency (tf-idf) vector [SB88] for 
each sample group. 3) Suppose that Un and U 12 are two tf-idf vectors corresponding 
to the two sample groups in the training dataset, and U 21 and U 22 are tf-idf vectors 
corresponding to the two sample groups in the testing dataset. The similarity between the 
sample groups is then calculated as max(< XJii,TJ 2 i > -|- < Ui 2 ,U 22 >, < Uii, U 22 > 
+ < Ui 2 ,U 2 i >), where < a, b > denotes the cosine similarity [LKS'''07] of two vectors 
a and b, calculated as a normalized dot product < a, b >= Essentially, this 

measure identifies the best match between the sample groups in the training dataset and 
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testing dataset while taking into account the possibility that the groups could be matched 


in reverse order. 


2.4.5 Calculation of TF-IDF vectors of sample groups 

Note that a sample group is called “subset” in NCBI GEO. Annotation of a sample group 
includes UMLS concepts, mapped from both (1) its text title and (2) all its samples’ text 
description. In our experiment, each sample group is viewed as document and each UMLS 
concept is viewed as a term. Thus, given 587 human microarray datasets, we collected 
4065 documents (sample groups) and 1277 terms (UMLS concepts). 

In NCBI GEO annotation of sample groups, its title and each samples’ description 
share almost the same set of UMLS concepts. Our practical experiments showed that, in 
such complicated annotation environment, the frequency of a UMLS concepts in a sample 
group does not indicate the importance of the UMLS concept. So we used only 0 or 1 
to represent the term frequency of each UMLS concept. Then IDE of each term (UMLS 
concept) is counted based on the binary TE vector of each document (sample group). 
Details of how to get TE-IDE is presented in the following formulas: 


TE 


term, document — 


/Z)i^term — 

TFIDFterm, document — 


log 


1, if this term occurs in the document; 
0, otherwise. 

number of all documents 
number of documents this term occurs 


TE 


term, document 


X IDF, 


term 


( 2 . 2 ) 


(2.3) 

(2.4) 
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In practical experiments, we noticed that a large portion of UMLS concepts are shared 
by two sample groups in the same dataset (e.g., Un and U 12 from the training dataset). 
Ideally, we expect that annotation of two sample groups from the same dataset should 
reflect their phenotype differences, not commonness. Thus, we mask by zero in TF-IDF 
vectors, the common UMLS concepts shared by the same-dataset sample groups, e.g, Un 
and U 12 (or U 21 and U 22 from testing dataset), when computing annotation similairty. 
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Chapter 3 


A Stable Iterative Method for Refining Discriminative 
Gene Clusters 

3.1 Introduction 

Microarray has become an important tool for identifying genes that discriminate sample 
phenotype classes because of its power of monitoring the expression levels of thousands of 
genes in a single experiment. Finding discriminative genes with gene expression data is 
actually the feature selection problem in classification theory. From the machine learning 
point of view, it is critical since the gene expression datasets usually contain a small 
number of experiments (called samples) and a large number of genes (called features) 
in each experiment. The selected highly discriminative genes after hltering out those 
non-representative genes which may dilute the pattern in classification computation can 
be further studied for the investigation on the biological mechanisms that are responsible 
for sample phenotype class distinction. 

A number of efforts have been put in searching effective gene selection methods (For 
example [GST'''99,GWBV02,XFZ01,AM02,FSMJ03,ZLS'''06]). Due to the small-sample 
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size and high-dimension properties of the sample phenotype classification problem, it is 
not difficult to find a feature subset that can perfectly discriminate all the samples [MR03]. 
In fact, theoretical study in [Cov65] showed that even for the non-informative, randomly 
generated dataset, the expected size of a feature subset that can linearly discriminate all 
the n samples is just (n -|- l)/2. In microarray data analysis, there can be a large number 
of highly discriminative subsets containing only a couple of genes; and each individual 
gene in such a subset is not necessarily highly discriminative. For example, we observed 
by exhaustive search that there are as many as 10,173 perfect 3-gene subsets for sample 
phenotype classification with the weighted voting method proposed by Golub et al and 
with their proposed training-test split [GST+99]; and these gene subsets cover 3,337 
genes (93.4% of all the 3,571 genes in the datasets after preprocessing). This observation 
suggests that a method of finding a highly discriminative compact gene subset is not 
enough. The variability of the subsets found by such a method likely hinders the discovery 
of real interaction among the genes given that the method is usually sensitive to both the 
choice of samples and noise in the microarray data. 

The fundamental limit and challenges mentioned above motivates us to design more 
robust methods by taking into account the expression similarity information among genes. 
In this paper, we identify a series of discriminative gene clusters by running clustering 
and feature selection processes iteratively, where the centroids of the clusters are used to 
form predictors. This work also shows that the predictor constructed in this way is more 
stable and less sensitive to the choice of training samples. Because biological functions are 
usually resulted from collective behavior and coordinated expression of a group of genes 
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rather than that of an individual gene, genes grouped according to their co-expression 


pattern may be more powerful in revealing gene regulation mechanisms. 

Our approach to generate discriminative gene clusters is a combination of supervised 
and unsupervised technique In recent years, Jornsten and Yu [JY03] and Dettling and 
Buhlmann [DB02] proposed similar combination approaches. However, there are major 
differences between their methods and our method. We use a multivariate approach for 
cluster selection, while Dettling and Buhlmann [DB02] employed a univariate approach, 
which assumes the independence of the contribution of clusters to sample phenotype clas¬ 
sification. Although such hypothesis reduces computational complexity for large datasets, 
the accuracy is compromised since the complex biological interaction among gene clusters 
is not properly reflected. We exploit a multivariate approach in the content of gene expres¬ 
sion analysis since it accounts for the joint contribution of clusters to sample phenotype 
classification. It is known that such complex phenotype like cancer is resulted from not 
expression of individual genes but rather the underlying genetic networks. Thus a mul¬ 
tivariate approach that emphasize the combinatorial effect of variables is more suitable 
to the exploration of the interaction among variables. Our method differs from [JY03] 
in the following two aspects: Although both works adopt multivariate approach, first 
of all, in their information-based approach, clustering and cluster selection are done si¬ 
multaneously, resulting in a set of clusters optimizing the Minimum Description Length. 
In comparison, our computation-oriented approach is a refining process where cluster¬ 
ing and cluster selection are performed alternatively in each iteration step with better 
and better results. Secondly, the clusters generated with Jornsten and Yu’s approach 
include both active and inactive ones. Here, active clusters are those whose centroids are 
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relevant to sample phenotype classification, and inactive ones are not. Our method is 
essentially a backward approach [AM02]. It iteratively eliminates the less active clusters 
and re-clusters the remaining genes in the active clusters, reducing the negative influence 
of non-discriminative clusters on the sample phenotype classification. 

Our program outputs a series of cluster sets that have increasing discrimination power 
for training samples without losing prediction power on the test samples, as indicated in 
our experimental results. It achieved similar or better prediction accuracy than the 
known methods aforementioned for most of the tested datasets in our validation process. 
More importantly, our test shows that the centroids of the output clusters using different 
sets of training samples are stable and consistently achieve significant proximity to the 
global optimal gene clusters obtained by using all the samples. Another advantage of our 
method is that it provides researchers with flexibility to decide which cluster set should 
be chosen for their purpose. 


3.2 Results 

We implemented the algorithm (described in the Methods section) as MATLAB functions. 
It runs on a PC with the Windows operating system. The SVM program written by 
Cawley was downloaded from the website http://theoval.sys.uea.ac.uk/ gcc/svm/toolbox. 
In this section, we present the detailed test results on both simulated and Leukemia 
AML/ALL datasets [GST+99]. We also have tested our method on other real datasets 
and compared the performance of our algorithm with those reported in the previous 
literature. The details of the performance measures are described in the Method section. 
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3.2.1 Simulated datasets 


We generated 100 simulated binary classification datasets using a simple stochastic model. 
Each simulated dataset contains 100 samples evenly split into two sample phenotype 
classes. Both training and test samples contain 25 samples in each class. 

Each dataset contains of 400 genes evenly divided into four gene clusters. Two of the 
four clusters are relevant to sample phenotype classification and these two discriminative 
clusters Ci and C 2 contribute to sample phenotype classification independently. Their 
centroids x(C'i) and x(C' 2 ) are generated according to the sample phenotype class labels. 
Each component of x(C'i) ™ ^ position is generated according to normal distributions 
N(l, 0.5) or N(-l, 0.5) depending on whether the corresponding sample is in class 1 or 
class -1, while each component of generated according to N(-l, 0.5) if the sample 

is in class 1 and N(l, 0.5) otherwise. Similarly, the centroids of the non-discriminative 
clusters 6*3 and 6*4 are generated according to the normal distribution N(l, 1) and N(- 
1,1) regardless of the samples’ class. For each i=l, 2, 3, 4, the expression values of a 
gene in the cluster Ci are generated according to the multivariate normal distribution 
N{x{Ci),di/A), where di = minj^id{x{Ci),x{cj)). 

We run our algorithm with the input gene set contains all the 400 genes for each of 
the 100 simulated datasets. The performance results are summarized in Figure 3.1. We 
observed that the classification performance of the generated clusters keeps increasing as 
the iteration process goes. The average classification accuracy of these tests jumps from 
0.756 up to 0.848 (Figure 3.1a); and the classification accuracy 9test on training samples 
goes up from 0.720 to 0.984 (Figure 3.1b). 
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(a) (b) 




(c) (d) 

Figure 3.1 The performance analysis on simulated datasets. The solid lines indicate 
the average values; and dotted lines indicate one standard deviation from the averages. 
The X-axis represents the number of genes in Si. Note that, when the generating process 
goes, the number of genes in Si decreases, (a) The classification accuracy Otest on the 
test samples, (b) The classification accuracy Otrain on the training samples, (c) The 
percentage psim (i) of truly discriminative genes in Sp (d) The p-value psii) based on 
5{Ai). 
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We also observed that more and more truly discriminative genes are identified in the 


active clusters as the algorithm proceeds. Since the genes in the discriminative clusters are 
known in each simulated dataset, we computed the ratio Psimii) = of the truly 

discriminative genes over all the genes in for each iteration i. The active clusters output 
Psimii)-, just before the algorithm terminates is about 0.778 (Figure 3.1c). Recall that, 
at each iteration i, the algorithm generates k = 50 active gene clusters since the number 
of training samples = 50 for each simulated dataset. We found that at each iteration 
i, the centroids of two active clusters are very close to x(C'i) and x(C' 2 ), the centriods of 
the discriminative clusters in the model. This is reflected by the indistinguishably small 
p-value ps{i) calculated based on d{Ai,A'). Here d{Ai,A') is the ’average’ Euclidean 
distance of centroids between an active cluster in Ai and its closest cluster in A' = 
{Cl,C2}. 

In the same time, the centroids of active clusters become more and more distinguish¬ 
able from each other, increasingly close to the average pairwise distance of all 400 genes, 
and such trend can also be reflected by the increasing p-value ps{I) from 0.228 up to 
0.476 (Figure 3.Id), calculated based on 5{Ai), the average Euclidean distance between 
the centroids of active clusters in Ai. Meanwhile, the Silhouette width ijd{Ai) of active 
clusters in Ai increases from 0.826 to 0.980. 

3.2.2 Leukemia dataset 

Leukemia AML/ALL dataset [GST'’'99] contains the expression values of 6,817 human 
genes in 47 acute lymphoblastic leukemia (ALL) and 25 acute myeloid leukemia (AML) 
tissue samples. After performing the threshold filtering and logarithmic transformation 


39 



procedure, we obtained a reduced dataset with only 3,571 genes. Here, we validate our 


algorithm by using three-fold cross validation. In each run, we randomly selected two third 
of the samples as the training samples and the rest as the testing samples. The samples 
of different phenotype classes are kept proportional in the training and test samples. The 
resulting dataset was further normalized by rescaling the variance of expression values of 
each gene to 1 in the training samples, and then applying the same rescaling factor to 
the expression values of that gene in the test samples. 

We conducted the three-fold cross validation for 100 times. To reduce computational 
cost, we restrict our algorithm on small portions of discriminative genes. In each run, the 
algorithm starts with the input gene set consisting of the 357 genes (10% of all the 3,571 
genes) that are highly correlated with the training samples’ phenotype classification in 
terms of the correlation metric proposed in [GST+OO]. 

Figure 3.2 summarizes the values of the different performance indicators. The average 
classification accuracy 6 train on the training samples ranges from 0.994 up to 1 (Figure 
3.2b); and the average classification accuracy 9test on the test samples increase slightly 
from 0.966 to 0.972 (Figure 3.2a). These results show that the centroids of the clusters 
generated in different iteration steps discriminate the training samples better and better 
without significant decrease of its generalization ability. 

For the evaluation of our algorithm, we searched for perfect 3-gene subsets, which can 
be used to perfectly classify all 72 samples using the weighted voting classifier trained on 
all the samples. This search resulted in 9,722 perfect subsets. We selected 48 (roughly 
equal to rir) genes <7* (1 < i < 48) with highest occurrence frequency to form the cluster 
set = {{g'j}|l <i< 48} for comparison with the clusters generated by our algorithm. 
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Figure 3.2 The analysis of the three-fold cross validation performance of the algorithm 
on the Leukemia dataset. The dotted lines indicate the performance values in individual 
tests. The solid lines indicate the average values; and the dotted lines indicate one 
standard deviation from the averages. The X-axis represents the number of genes in S^. 
(a) The classification accuracy on the test samples, (b) The classihcation accuracy 
Strain on the training samples, (c) The p-values psii) based on d{Ai, A'l). (d) The p- 
values ps{i) based on d{Ai,A'2). (e) The p-value Paiiii) based on 6{Ai). (f) The average 
Silhouette width dj{Ai) of active clusters in Ai. 
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We also evaluate our algorithm using another cluster set A 2 , the hnal set of active 
clusters generated by our algorithm with S' as the input gene subset and with all the 
72 samples as the training samples, where S' is the set of the 357 genes (10% of all the 
3,571 genes) that are highly correlated with the AML/ALL sample classes in terms of the 
correlation metric proposed in [GST'''99]. 

Probably because of the selection sensitivity of the correlation metric of [GST'’'99] 
resulting from small sample size, the gene sets that are selected according to different 
training-test splits do not have many genes in common. In all the 100 validation exper¬ 
iments, only 120 genes appearing in every input gene set . This number is quite small 
compared with 1,071, the number of the genes appearing in some input gene sets (each of 
size 357). By contrast, the centroids of clusters in the set generated in each of iterations of 
our algorithm in different runs are significantly similar to the selected discriminative genes 
in A/ and A 2 at most iteration steps. This is reflected by the very small p-values Psii) 
computed based on d{Ai, A/) and d{Ai, A' 2 ), which range from 4.11 x 10“^ to 6.12 x 10“^ 
(Figure 3.2c) and from 5.62 x 10“^ to 2.38 x 10“^ (Figure 3.2d) respectively. The above 
observation strongly suggests the stability associated with discriminative clusters rather 
than with individual discriminative genes. Such stability is one of the main advantages 
of our method. 

We further studied the biological function of genes in the active clusters using Gene 
Ontology (GO), focusing on the biological processes located at the fifth level of the GO 
hierarchy. For the set of all genes from active clusters in A*, we hnd its enriched biolog¬ 
ical processes by calculating the hyper-geometric p-value, then convert the p-value into 
a log score s by s = — logiQ(p). Table 3.1 gives the top four biological processes that 
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Biological process 

Average score 

Inflammatory response 

2.72 

Regulation of catalytic activity 

2.19 

Response to wounding 

1.98 

Positive regulation of metabolic process 

1.95 


Table 3.1 Significantly enriched biological processes 

are most significantly enriched in the active clusters in the final iteration, in terms of 
the score averaged from the 100 validation experiments. All four processes are frequently 
associated with leukemia. In addition, we inspected the change of proportion of the genes 
of the four processes in the active clusters during refinement iterations. The proportions 
are also averaged over the 100 validation experiments. Figure 3.3 shows that when the 
active clusters contain less than two third genes in input gene set S, the average gene 
proportions of all four processes monotonically increase until the last iteration. Such 
convergence strongly suggests that our method can indeed rehne clusters into biologi¬ 
cally meaningful ones. Interestingly, processes of inflammatory response and response 
to wounding showed very similar convergence patterns. In fact, these two processes are 
closely related. The same holds for biological processes of regulation of catalytic activity 
and positive regulation of metabolic process. 

3.2.3 The performance analysis on other real datasets 

Besides the above dataset, we also tested our algorithm on seven other datasets. The 
descriptions of these datasets are as follows. Altogether, we derive 12 sample phenotype 
classification studies from the 8 datasets. 
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Average proportion 



Gene number 

Figure 3.3 Average proportion of genes of the four biological processes during the refine¬ 
ment iterations. The solid black, blue, red, and green lines corresponding to the ordered 
processes in Table 3.1 from top to bottom. 
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1. ALL T/B Cell dataset. The 47 ALL samples in Leukemia ALL/AML dataset 
in [GST'*“99] are further divided into 39 T-cell samples and 9 B-cell samples. 

2. Breast cancer dataset [WBD'’'01]. This dataset comprises 7,129 genes and 49 sam¬ 
ples being divided into two sample phenotype classes according to their estrogen 
receptor (ER) responses: 25 for ER positive and 24 for ER negative. 

3. Carcinoma dataset [NASLOl]. It contains the expression levels of about 6600 genes 
in 18 tumor and 18 normal tissues. 

4. Colon dataset [ABN+OO]. It consists of the expression levels of 6,500 human genes 
in 40 tumor and 22 normal tissues. 

5. Diffuse Large B-Cell Lymphoma (DLBCL) dataset [SRT'*'02]. It consists of expres¬ 
sion values of 6,817 genes in 58 DLBCL and 19 Eollicular Lymphoma. 

6. The Melanoma dataset [BMC'’'00]. The dataset consists of the expression ratios 
of 6,971 human genes in 12 Unclustered Cutaneous Melanomas and 19 Cutaneous 
Melanomas samples. 

7. Prostate dataset [SER+02]. The dataset consists of the expression levels of 52 
prostate and 50 normal samples of 6,744 human genes. 

8. The Small, round blue cell tumors (SRBCT) dataset [KWR+Ol]. It consists of 88 
samples divided into five sample phenotype classes: neuroblastoma (NB) (18 sam¬ 
ples), rhabdomyosarcoma (RMS) (25 samples), Burkitt lymphomas (BL) (11 sam¬ 
ples), the Ewing family of tumors (EWS) (29 samples) and others (5 samples). The 
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dataset has 2,308 genes. Since we consider the binary sample phenotype classifica¬ 


tion problem, we derived four binary sample phenotype classification datasets from 
this dataset using the one-against-all rule: SRBCT-NB, SRBCT-RMS, SRBCT-BL 
SRBCT-EWS. RFERENCES 


We preprocessed each dataset by applying filtering and logarithm transformation if 
necessary. Eor each sample phenotype classification study, we run our algorithm 100 
times by choosing random training-test splits in the same way as the Leukemia dataset 
described in the last subsection. The performance of our method is summarized in Table 
3.2. In the table, there are two columns for each performance measure, indicating the 
average values of the corresponding measures at the first and last iteration step of our 
algorithm. Because the exhaustive search of the most frequent globally optimal genes 
for constructing is time-consuming, we only compare the active clusters with A 2 
constructed as follows; 1) we apply our algorithm on all samples in the dataset and 2) 
use the active clusters of the last iteration as A^. 

The classification accuracy 6test on the test samples shows that among 9 of 12 sample 
phenotype classification studies, the prediction performance of active clusters in Aj in¬ 
creases slightly from the start to the end of each execution, which are highlighted in the 
table. The value of Otest for the remaining three studies (Breast, Colon and Carcinoma) 
decrease slightly. The above observations indicate that for all datasets we tested, there is 
no significant decrease in the generalization ability of the active clusters At in obtained 
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Datasets 

1 ^test 1 

1 p5(i) based on d(Ai, A 2 ) 

Paii{i) based on S{Ai) 

cj(Ai) 

Leukemia 
ALL T/B 
cell 


0.977 

1.72E-02 

8.28E-03 

0.088 

0.331 

0.406 

0.973 

Breast 

0.843 

0.842 

1.33E-02 

8.36E-03 

0.142 

0.421 

0.351 

0.974 

Carcinoma 

0.983 

0.981 

2.96E-02 

3.20E-02 

0.194 

0.252 

0.382 

0.966 

Colon 

0.814 

0.806 

2.43E-02 

2.06E-02 

0.75 

0.755 

0.673 

0.978 

JJLHCJL 

0.896 

0.929 

8.75E-02 

1.99E-02 

0.441 

0.514 

0.716 

0.982 

Melanoma 

0.913 

0.921 

1.71E-02 

2.25E-02 

0.129 

0.463 

0.272 

0.957 

Prostate 

0.889 

0.916 

4.79E-02 

2.27E-02 

0.495 

0.541 

0.68 

0.987 

yKBC'L- 

BL 

1 

1 

3.63E-04 

7.52E-05 

0.314 

0.322 

0.682 

0.984 

SHBC'i'- 

EWS 

0.956 

0.986 

5.06E-04 

9.17E-05 

0.297 

0.408 

0.634 

0.984 

SHBCi- 

NB 

0.989 

0.996 

2.99E-04 

6.42E-05 

0.321 

0.436 

0.665 

0.986 

yKBC'L- 

RMS 

0.974 

0.98 

4.82E-04 

8.18E-05 

0.304 

0.347 

0.63 

0.989 

Lukemia 
AML / 

ALL 

0.966 

0.972 

5.62E-03 

2.38E-03 

0.212 

0.398 

0.627 

0.98 


Table 3.2 The performance of the algorithm for different sample phenotype classification 
studies. 


in each iteration step. The classification performance Otrain on the training samples in¬ 
creases in all of the 12 studies, which indicates that the separation of the training samples 
improves for all studies. 

All the 100 input gene sets S vary a lot in different runs for each study. There are 
only 1.1% to 5.1% of all the genes appearing in all the 100 input gene sets S, while at 
least 23.8% to 51.7% genes appear in some input gene sets. By contrast, the centroids 
of clusters in SAi generated by our algorithm at each iteration step i are stably close 
to the optimal centroids of clusters in as reflected by the p-values Ps{i) ranging from 
2.99 X 10“^ to 8.75 x 10“^ at the first iteration step and those ranging from 6.42 x 10“^ to 
3.20 X 10 ^ in the last iteration step. The consistent closeness of the clusters generated 
in different repeats can also be reflected in the standard deviation of /Os(i), which are 
limited from 0.32 to 0.96 times of the absolute values of in the first iteration step and 
0.24 to 1.37 times at the last iteration step. 
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During the generation process, the p-values Paiiii) of average pairwise distance 5{Ai) 
among centroids of clusters in Ai keeps increasing for all 12 studies (ranging from 0.088 to 
0.750 at the first iteration step and from 0.252 to 0.755 in the last step), and the average 
Silhouette width of active clusters cd(Ai) keeps increasing for all the 12 studies (ranging 
from 0.230 to 0.698 at the first iteration step and from 0.964 to 0.989 in the last iteration 
step). This indicates that the clusters in Ai are more and more distinct in general. 

In summary, our test shows that on real microarray datasets, our algorithm is able 
to generate clusters that separate the training samples with increasing prediction accu¬ 
racy and closeness to known optimal clusters. Such discriminative cluster refinement is 
consistent with what we have observed on simulated datasets. 

3.2.4 Comparing the sample phenotype classification 
performance to other studies 

In this section, we compare the cross validation performance of our method with pre¬ 
vious works reported in [JY03, DB02, SRT+02, JSR02]. For the purpose of comparison, 
we converted the classification performance from the classification accuracy Otest into the 
error rate. Table 3.3 summarizes the comparison of our algorithm (of both binary and 
multi-sample phenotype class versions) with others by the cross validation error rates. It 
is difficult to make direct comparisons with other approaches in the literature, because 
the specific data sets or data preparation are not always available. However, the perfor¬ 
mances our method is in general comparable to others. In the comparison, the DLBCL 
and Carcinoma datasets are validated using leave-one-out validation; and the remaining 
datasets are validated using three-fold cross validation. 
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Datasets 

Our algorithm 

Dettling and 

Buhlmann 
(2002) 

Jornsten and 
Yu (2003) 

Shipp et al. 
(2002) 

Lukemia AML/ALL 

3.43 - 2.57 

6.58 - 2.71 



Leukemia three 

classes 

13.8 - 9.3 


12.6 


Breast 

16.14 - 14.11 

3.00 - 0.75 



Carcinoma 

5.6 - 0.0 




Colon 

19.41 - 18.23 

23.35 - 15.95 

MT6 


DLBCL 

8.7 - 7.4 



7.8 

Prostate 

11.09 - 8.36 

16.47 - 6.91 



SHBC'L multi class 

5.92 - 4.27 

5.76 - 0.43 




Table 3.3 Comparison of our algorithm with others. 


Dettling and Buhlmann [DB02] reported the error rate of their algorithm for different 
datasets. They employed nearest neighbors and aggregated trees as the classifiers in their 
three-fold cross validation test. For the leukemia AML / ALL dataset, our algorithm 
seems to achieve a slightly lower error rate than theirs. In the Colon and Prostate datasets, 
the error rate of our algorithm lies between that of theirs. For the Breast dataset, the 
error rate is significantly higher than that of Dettling and Buhlmann’s. However, we 
obtained the performance using all the original 49 samples. The error rate in each test 
ranges from 7.89 to 6.90. According to [WBD+01], at least 7 out of the 49 samples are 
inherently erroneous. In comparison, Dettling and Buhlmann [DB02] used the 38 good 
samples selected by [WBD+Ol], and the error rate ranges from 1.14 to 0.10. The 38 
samples used in Dettling and Buhlmann [DB02] consists none of the above 7 erroneous 
samples. Thus, we believe that the performances of ours and Dettling and Buhlmann’s 
are still comparable for the Breast dataset. 

For the DLBCL dataset, the leave-one-out performance of Shipp et al. [SRT+02] is 
in our performance range. For Carcinoma dataset, Jaeger et a. [JSR02] achieved perfect 
leave-one-out performance, and our best performance can match theirs. For the Colon 


49 




dataset, both ours and Dettling and Buhlmann’s error rate are higher than Jornsten and 


Yu’s. 

We also test the performance of the multiple-sample phenotype class version of our 
method against other methods. For the Leukemia three-class dataset, our method is 
comparable to Jornsten and Yu’s method. However, for the SRBCT multi class dataset, 
our algorithm seems had a slightly higher error rate than that of Dettling and Buhlmann’s. 


3.3 Conclusions 

Due to the small-sample-high-dimension nature of the microarray dataset, it is not diffi¬ 
cult to find highly discriminative gene subsets of small size. However, if a gene selection 
process is unstable with the choice of training samples, the biological significance of the 
resulting gene subsets is often not guaranteed. In this paper, instead of finding individual 
discriminative genes or gene subsets, we propose a novel backward approach for gener¬ 
ating a series of highly discriminative gene clusters. Compared to selection of individual 
discriminative genes, genes grouped in these clusters are more stable when subject to 
change of training samples. Therefore they could provide more convincing support to 
gene interactions that are associated with the sample phenotype classes. In future, we 
will work with biologists to study the biomedical implication of these clusters. 

Regarding to the sample phenotype classification performance, the gene clusters pro¬ 
duced by our approach can generally achieve good cross validation performance compared 
to the existing methods for most of datasets we tested. More importantly, our test exper¬ 
iments show that regardless of the choice of training samples, the centroids of the clusters 
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generated are stable and significantly close to the known optimal gene clusters found using 
all the samples. All these indicate that our approach is promising. However, the current 
version of our algorithm is time-consuming. In future, the computational efficiency will 
be investigated. On the other hand, we used K-means algorithm, a typical partitioning 
based clustering method to seek a certain number of clusters that minimize the sum of 
squared distances between each gene and its centroid. The drawback for K-means is the 
subjective specification of input parameters such as the number of clusters and initial 
centroid locations. For unknown microarray datasets, such information is unavailable. 
Furthermore, different input parameters may result in significantly different clustering 
results. K-means can only converge to local optima, rather than the global optimum. In 
order to address these problems associated with K-means clustering. We plan to apply 
a novel clustering method based on Random Matrix Theory (RMT) [ZW08] which is 
completely objective and do not require the specification of cluster number and initial 
centroid locations. RMT method avoids being trapped into local optima. Furthermore, 
most previous clustering methods including K-means and hierarchical clustering parti¬ 
tion members into non-overlapping groups. The RMT method allows the same genes in 
multiple groups to reflect the fact that a single gene may contribute to multiple biological 
pathways. 

In order to test the discriminative power of a certain gene cluster, additional criteria 
established by statistical analysis should also be conducted to identify and remove inactive 
cluster. For example, gene expression pattern observed in the active clusters should 
be less likely to appear in the control set. Chi Square test might be used to test the 
significance. Some data normalization technique may be considered in the preprocessing 
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step to improve the data quality. Furthermore, more suitable backward feature selection 


method needs to be exploited so that the gene clustering and cluster selection processes 
can be integrated better. Our approach provides a flexible framework that allows us to 
test the performance of various computing modules in a various ways of combinations. 

Our method can not only be easily extended to multi-sample phenotype class predic¬ 
tion, but also can be easily extended to integrate gene clusters obtained from multiple 
datasets that contain sample phenotype classification study of same type. Such integra¬ 
tion can be obtained by counting the occurrence frequency of a gene in similar clusters 
obtained from different datasets. In such way, by pulling evidences from multiple datasets 
from different experimental sources, genes in such clusters of high occurrence frequency 
would have higher confidence to form truly discriminative gene sets. 


3.4 Methods 

3.4.1 Algorithm 

In this subsection, we present our backward approach for generating discriminative gene 
clusters. The method is executed in a repetitive manner. In each pass, the method first 
groups genes into clusters that may indicate functional categories [THC'’'99]. It then 
ranks the generated clusters and eliminates those clusters that are less discriminative so 
that the re-clustering of remaining genes can generate modules with better resolution and 
stronger association with the sample phenotype classes. In the clustering stage, we use 
the K-means method to group the genes into a constant number of active clusters. 


52 



In the elimination stage, we use a backward feature selection method. This stage 


involves cluster validation and evaluation of the discriminative ability of active clusters. 
To validate clusters, we use the Silhouette width [KR90] to measure their validity. Assume 
the input genes are partitioned into p clusters Ci,C2, ■ ■ ■ ,Cp. Given a gene g, let Wg be 
the average Euclidian distance between g and another gene within the same cluster, and 
let bgj the average Euclidian distance between and a gene in a different cluster Cj. 
Then the Silhouette width a;„ of g is defined as lj„ = — ^9 g^j^d the Silhouette 
width of a cluster is defined as the average Silhouette width of all its members. It is easy 
to see that the Silhouette width of a cluster fall within the range from -1 to 1. A good 
cluster should have a high Silhouette width. 

To measure the discriminative ability of an active cluster, we adopt the idea of SVM- 
RFE method in [GWBV02]. Support Vector Machine (SVM) is a binary-class prediction 
method originated from statistical learning theory [VapOO]. A linear SVM first finds a 
decision hyperplane y = w^x -|- b that maximizes the separation between samples of two 
classes; and then it does class prediction according to the relative location of a new sample 
with respect to the hyperplane in the feature space. Note that the weight vector w found 
by the linear SVM indicates the relative importance of the genes for the classification. 
Here, we iteratively train a linear SVM and eliminate a gene cluster based on an overall 
evaluation on both the weight and the Silhouette width instead of discarding single gene in 
the original SVM-REE method. Such systematic approach makes the elimination process 
to better reflect the underlying biological meaning. 

Our method is summarized into the algorithm in Figure 3.4. In the algorithm, A 
denotes the set of inactive gene clusters; A* denotes the set of active clusters at each 
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iteration i; Si denotes the set of genes under consideration at the beginning of the iteration 


i; K denotes the number of clusters partitioned at each iteration step. For simplicity, we 
set K to be the number of training samples. 


Input: A gene expression dataset and a set 5 of the genes. 

1 ) A ^^ i ^— 0, Sq ^— S ^ 

2) Apply K-means to partition 5, into k (active) clusters, forming 4 i 

while do 

3) Calculate the Silhouette width for each active cluster in 4; 

4) Train a linear SVM using the centroids of active clusters in 4 i 

5) Find a least active cluster c, in 4 according to its Silhouette width and weight 
in the SVM, and insert it into A; 

6 ) ; /■ <- / + 1 

end do 

7) Output { 4 ,..., 4 }. 


Figure 3.4 The algorithm of selecting discriminative clusters. 


It is often difficult to determine how many clusters the genes should be grouped into 
for microarray datasets, which usually have complex expression patterns. The algorithm 
outputs K = Uj., the number of the training samples, active clusters in each iteration. 
This is because the expected size of a feature subset that can linearly discriminate all 
the samples is only (n^ + l)/2 [MR03]. Note that if the feature number is too small, the 
clustering will lose its resolution. 

Recall that the K-means clustering method starts with an initial partition of the genes. 
In order to make it more deterministic in Step 2, we first select k genes as follows: Find 
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a furthest gene pair and form an initial gene set G, and then iteratively find a gene with 
largest average Euclidean distance from the genes in G and add it into G until \G\ = k. 
We then partition all the genes into clusters by merging each gene with its nearest gene 
in G. 

The calcnlation of the Silhouette width of each cluster in Ai takes all the clusters in 
both sets Ai and into account A. At the ith iteration, the algorithm groups all the genes 
in the set Si into k clusters, forming the cluster set Ai, and then insert the least active 
cluster into the inactive cluster set A in Step 5 as follows. 

There are two important factors to evaluate in order to determine which cluster should 
be removed from Si and added into A. The first factor is the cluster’s Silhouette width. 
Another factor is the cluster’s discriminative ability in terms of its weight determined by 
the linear SVM constructed in Step 4. Here, we would like to eliminate a least discrim¬ 
inative cluster whose centroid is sufficiently representative of the expression pattern of 
the cluster (measured by the Silhouette width). In other words, we eliminate a set of well 
clustered genes whose expression patterns have little contribution to classification. On 
the other hand, those not well clustered genes will be re-clustered at later iterations. 

Since the above two factors are not always consistent, we adopt a multiple objective 
optimization technique appearing in [Gen] to find a nice tradeoff between these two factor 
and such multiple objective method is shown in Figure 3.5. 

Finally, we can extend the algorithm to the multiple-class case by adopting the popular 
one-against-all approach. In this approach, given a training test split, both training and 
test samples of a dataset of AT > 2 classes are transferred into K binary classification 
problems, each corresponding to classify samples from one class against samples from all 
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Given a set of clusters with the Silhouette width and the rescaled weight 

magnitude , where iv,-=2 |m',.|-1, w,. is the weight of the corresponding 

centroid in the SVM constructed in Step 4. We define the objective function 
/(/) = -ij.) + 5^ (>!>,- -vv,.) for each /, where f = argmin(H’, ) and /" = argmin(5] ) . 

i i 

The algorithm identifies the cluster with smallest f(i) as the least active cluster. 


Figure 3.5 Multiple objective optimization procedure for cluster elimination. 


remaining classes. Then our algorithm executed on the K problems results in K series 
of active cluster sets Aj^i,j = Then classifiers are constructed using K x k 

clusters from the K active cluster sets ..., Ax^ij^ by selecting ii,... Ak such that 

..., are roughly identical. Given the centroids of the above K x k clusters, a 
multi-class linear SVM is trained using training samples and tested on test samples. 


3.4.2 Performance measures 

We validate our method in terms of its classification performance and clustering perfor¬ 
mance. The classification performance is determined by the classification accuracy on 
training or testing samples. We use the SVM as the classifier to evaluate the generated 
gene clusters. Classification accuracy Otest on the test samples is dehned as the percentage 
of the correctly classified samples. However, we dehne classification accuracy Otrain on 
training samples as the average accuracy of the 10-fold cross validation on the training 
samples as suggested in [AM02] for less biased estimation of classification performance. 

The clustering quality is measured in terms of the density of the clusters, as well as 
the distinction between clusters and the closeness of the clusters to some reference clus¬ 
ters. They are measured respectively by (a) the average Silhouette width cd{Ai) of active 
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clusters in Ai produced in iteration i, (b) the average Euclidean distance 6{Ai) between 
the centroids of active clusters in Ai and (c) the ’average’ Euclidean distance d{Ai,A') 
of centroids between an active cluster in Ai and its closest cluster in a reference cluster 
set A' (the construction can be found in the Result Section). To be more precise, assume 
Ai = {Cl,..., Ck} and A' = {Di ,..., D^}- Eirst, from 1 to k, find recursively C( G Ai 
and D[ G A' such that d{x{C'i), x{D[)) = d{x{Ci),x{D'l))^ where 

xO denotes the centeroid of a cluster, d {^) the Euclidean distance between two vectors, 
A" = {C[,..., and A" = {D [,..., Then, the ’average’ Euclidean distance 

d{Ai,A') is defined as d{Ai,A') = ^ d(x(C(), x(T>()). 

We measure the statistical significance of average distances in both case (b) and 
(c) at each iteration i against the pairwise distances of all genes in the input gene set 
S in terms of the p-value psi})-, and against all the genes in the dataset in terms of 
the p-value pall{i)- In each case, the p-values are calculated according to the empirical 
distribution (null distribution) of the pairwise distance of genes randomly sampled in the 
whole dataset. 
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Chapter 4 


An integrative approach to characterize disease-specific 
pathways and their coordination 

4.1 Introduction 

The recent development of microarray technology has significantly facilitated the iden¬ 
tification of disease-related genes [SC03, CCS'''02, GST+99, LMA06]. However, many 
complex disease phenotypes are caused not by individual genes, but by the coordinated 
effect of many genes. Insight into the structure and coordination of disease-related path¬ 
ways is crucial to understanding the pathophysiology of complex diseases. However, it has 
proved difficult to infer pathways from microarray data by deriving modules of multiple 
related genes, rather than individual genes. The major challenges are; (1) Genes in¬ 
volved in a pathway may exhibit complex expression relationships beyond co-expression, 
which may be overlooked by standard microarray analysis methods such as clustering 
[ZKH^05]. (2) Pathways are dynamic and the current static annotation of pathways may 
not serve as a good template. In fact, pathways are manual dissections of the underlying 
dynamic gene regulatory network. Under different conditions, different segments of the 
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ensemble network will be activated, leading to condition-specific activation of pathways 
[YMH+07]. 

In this study, by integrating many microarray datasets we propose a novel method to 
simultaneously infer pathways and disease/phenotypic conditions under which the path¬ 
ways are activated. The identified pathways may comprise genes with complex expression 
relationships beyond co-expression. Due to the existence of a large amount of cancer 
microarray data, we used cancer as our case study. We collected a series of microarray 
datasets measuring different types of cancers, and a series of datasets measuring other cel¬ 
lular/physiological conditions. We first construct a differential co-expression network, in 
which each node represents a gene and each edge indicates a gene pair that is frequently 
co-expressed in cancer datasets but not in non-cancer datasets. We then dissect the 
networks into cancer-subtype specific network modules by considering (1) co-expression 
dynamics and (2) network topology. Figure 4.1a illustrates the conceptual pipeline of 
our method. 

To measure co-expression dynamics, we use second-order expression similarity, which 
we proposed previously [ZKH'*“05]. Briefly, if we define first-order expression similarity 
as the expression similarity of two genes from one dataset, then second-order similarity 
measures whether two gene pairs simultaneously exhibit either high or low expression 
similarity across multiple datasets. In general, high first-order similarity suggests the ex¬ 
istence of a functional link between two genes, and clustering based on the second-order 
similarity captures multiple functional links always activated and deactivated under sim¬ 
ilar conditions. Such functional links are likely to comprise a functional module. Inter¬ 
estingly, genes in a second-order cluster may not always have high first-order similarity 
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Figure 4.1 Overview of analysis procedure. (A) Flow chart of the analysis pipeline. (B) 
Schematic illustration of the concept of second-order similarity. It is obvious that the 
overall expression similarity between the two gene pairs (genes 1 and 2 versus genes 3 
and 4) is not significantly high, but their first-order expression correlation profiles exhibit 
high second-order similarity. (C) Schematic illustration of the dissection of differential 
co-expression networks into network modules based on the co-expression dynamics and 
network connectivity. In the heat map, every column corresponds to a dataset and every 
row corresponds to a gene pair. Red, black, green and grey corresponds to positive, 
low, negative and missing correlations, respectively. By hierarchical clustering, the gene 
pairs fall into two major second-order clusters. The 9 gene pairs in the green cluster 
comprise three connected network components, whereas the 6 gene pairs in the red cluster 
give rise to three connected components. Furthermore, by considering the second order 
cluster of a higher level of the hierarchy, which consists of both green and red clusters, 
the six networks are united to form two connected networks, reflecting the hierarchical 
modularity of cancer co-expression networks. 
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(see an example in Figure 4.1b); therefore, second-order analysis allows us to identify 
functional modules that are inaccessible to co-expression analysis. 

Given multiple gene pairs sharing high second-order similarity, we further divide them 
into network components based on their connectivity on the differential co-expression 
graph (see an example in Figure 4.1c). We observe that genes within a connected 
network component are more likely to participate in the same specific pathway than those 
between different components, which, in turn, are likely to be involved in different relevant 
pathways. This may reveal high-order cross-pathway coordination. In fact, hierarchical 
clustering of differentially co-expressed gene pairs based on their second-order similarity 
results in a hierarchical modularity in terms of relevance of functional links. We designed 
a linear scaling model to select modules by considering both module size (number of 
edges) and within-module second-order similarity. Then, given selected modules, we can 
further infer datasets (phenotypic conditions) in which a module is activated, i.e. in which 
genes in the module coordinate. 

Applying our methods to 32 cancer-related microarray datasets, and 23 non-cancer 
related datasets, we derived 162 second-order clusters consisting of 224 network modules, 
activated either in cancer or in specific cancer subtypes. In particular, we identified a 
breast cancer specific network module that involved in tumor suppression via platelet- 
derived growth factor (PDGF)-like signaling, more importantly, a hub gene PDGFRL 
that may play an important role in this tumor suppressor module. 
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4.2 Results 


4.2.1 Network properties of the cancer differential 
co-expression network 

We curated 32 human microarray datasets (1,764 expression profiles in total) measuring 
cancers of 12 tissues, and 23 datasets (1,158 expression profiles) not related to cancer 
(e.g. normal tissues, chronic granulomatous disease, Huntington’s disease, inflammatory 
response). We first identify gene pairs which consistently demonstrate higher correlation 
in cancer versus non-cancer datasets based on a robust correlation estimator, the normal¬ 
ized Percentage Bend correlation (for details see Methods). In following sections, if not 
specified, the term correlation will by default refer to the normalized Percentage Bend 
correlation. These criteria result in 6,035 gene pairs covering 1,967 genes. The 6,035 gene 
pairs, each representing a potential conditional functional link, can be represented as a 
differential co-expression network. In this network, each gene is represented as a node 
and each differential co-expression relationship is represented as an edge. 

It has been reported that co-expression networks follow a scale-free node degree distri¬ 
bution [JMRWK04]. We observed that the differential co-expression network also follows 
such a topology, where only a small number of nodes act as “highly connected hubs” (see 
node degree distribution in Figure 4.2). This indicates that most gene-gene co-expression 
relationships differing between cancer and other phenotypes are associated with only a 
few “hub” genes. Such hub genes exhibit a high degree of coordination with many other 
genes in neoplastic states, and are therefore likely to play important roles in carcino¬ 
genesis and cancer progression. In fact, most hub genes fall into two main functional 
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categories: 1) core processes of neoplastic states such as cell division and chromosome or¬ 
ganization; or 2) dynamic interactions between cancer cells and their microenvironment 
such as angiogenesis, immune response, and cell adhesion . For those hub genes with 
unknown functions, we can predict their cancer-related functions based on their neighbor 
genes. For example, the 16 out of the 33 interacting partners of the ADP-ribosylation 
factor-like 6 interacting protein (ARL6IP) are involved in cell division (hypergeometric 
test p-value 1.6 x 10“^^). Thus, ARL6IP is likely to be involved in cell proliferation, con¬ 
sistent with its initial characterization as an interaction partner of the Ras superfamily 
member ARL6 [PBG'*'00]. As another example, while microfibrillar-associated protein 2 
(MFAP2) has long been known to bind to various components of the elastic extracellular 
matrix [JSOPOl], it has not been clear whether it serves more than a mechanical func¬ 
tion. We found that 6 out of its 24 neighbor genes are involved in cell adhesion (p-value 
7.7 X 10“^). In fact, a recent study found that MFAP2 binds to a neighbor gene Notchl 
and activates it [MLH'''06]. 

In the differential networks, we selected the top 112 hub genes that have degrees 
> 22, such that they together account for 30% of the total degrees on the differential 
co-expression networks. Many of those genes fall into two main functional categories: 
1) core processes of neoplastic states such as cell division and chromosome organization 
(36 genes); or 2) dynamic interactions between cancer cells and their microenvironment 
such as angiogenesis, immune response, and cell adhesion (24 genes). Genes falling into 
category (1) include TUBE (participates in microtubule-based movement; node degree 
70), CKSIB (cell cycle; node degree 43), KPNA2 (regulation of DNA recombination; 
node degree 44), RRMl (DNA replication; node degree 41), MAD2L1 (cell cycle; node 
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Figure 4.2 Node degree distribution on the differential co-expression networks. 
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degree 49), and S0X2 (establishment/maintenance of chromatin architecture; node degree 
30). Genes in category (2) include MAP2K7 (response to proinflammatory cytokines; 
node degree 50), TYROBP (cellular defense response; node degree 37), CD4 (immune 
responses; node degree 45). They are all reported to play a role in cancer pathogenesis 
and progression. In addition, hub genes in the category (1) behave as hub genes across 
most of datasets, while those in the category (2) tend to be in hub genes only in solid 
tumor datasets. 


4.2.2 Identification of pathway modules specific to cancer or cancer 
subtypes 

The differential co-expression network provides a summary of co-expression links fre¬ 
quently active across all types of cancers. However, it does not provide clues as to 
which set of links tend to be simultaneously active and inactive under which types of 
cancer. That is, the edges of a differential co-expression network may not be active in 
the same subset of datasets. In fact, the largest connected component of the differential 
co-expression network contains 5944 edges, which comprises 98% of all the edges in the 
network. Thus, based on connectivity alone we cannot break the network into functionally 
coherent and cancer-subtype specihc modules. 

To dissect the networks, we integrate two types of information, the co-expression 
dynamics and the network connectivity, to extract cancer-subtype specihc network mod¬ 
ules. First, we employ the second-order clustering approach to utilize the co-expression 
dynamics information. This includes two steps; (i) for any two genes connected with 
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an edge in the differential co-expression network, we calculate the expression correla¬ 


tion in each of the 32 cancer microarray datasets and store it in a vector, termed the 
first-order expression correlation profile of the genes; (ii) we then perform hierarchical 
clustering of all the gene pairs based on the Euclidean distance between the first-order 
expression correlation profiles. Unlike commonly used clustering approaches, the unit of 
the second-order clustering is a gene pair instead of a gene, and the distance between 
units is computed based on the first-order expression correlation profiles instead of the 
original gene expression profiles, hence the term “second-order” clustering [ZKH+05]. 
Since each edge represents a frequently occurring co-expression relationship in multiple 
cancer datasets, it likely represents a functional link. If a cluster of gene pairs follows 
the same co-expression pattern across multiple cancer datasets, it represents a module of 
functional links being turned on or off simultaneously across different cancer phenotypes. 

Given a second-order cluster of gene pairs, we further identify connected network com¬ 
ponents among them. We suggest that a set of gene pairs is more likely to be functionally 
related if they form a connected component. We found that the disjoint network mod¬ 
ules of same second-order cluster generally fall into different functional categories. From 
the 162 second-order clusters, we measure the functional similarity of genes within each 
connect network, and between networks of the same cluster using number of shared Gene 
Ontology functions. A t-test of the distinction in functional similarities gives p-value 
1.6 X 10“® on our selected network modules shows that the genes tend to be significantly 
more functionally coherent within a network than between networks. Thus each network 
is more likely to represent a pathway. But in addition, these networks are in the same 
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second-order cluster, this indicates that these networks are activated together in certain 


phenotypes. 

Given a second-order hierarchical clustering tree, we traverse the tree bottom up to 
retrieve connected network components. In general, the size of a connected component 
(5, the number of edges) decreases with the second-order diameter (D), defined as the 
largest pairwise second-order distance. We found that S and D show a linear scaling 
relationship in a logarithm scale (Figure 4.3). We are especially interested in outliers - 
network components small in D but large in S, which represent tightly clustered network 
modules relative to their size. We define the modularity score A of a subnetwork using 
a linear scaling model A = alog2(5') — log 2 (D) — /?, where a and /? are estimated using 
linear fitting. With our data, we obtain a = 0.13 and P = 2.2. We select the top 60% 
of networks {S > 4) ranked by A scores, removing those networks having D < 0.34, 
and merging heavily overlapping networks. This procedure resulted in 162 second-order 
clusters comprising 224 connected network modules, with size ranging from 4 to 64 edges. 
175 (78%) modules are statistically significantly functionally homogenous based on the 
GeneOntology Biological Process annotation (hypergeometric test p-value < 0.01). The 
most predominant functional categories are cell cycle, cell division, cell proliferation, 
response to stress, immune response and cell adhesion consistent with known pathological 
mechanisms of cancer. 

One main feature of our approach is that it can simultaneously discover network 
modules and the types of cancer in which the modules are activated. Figure 4.4a shows 
a module that is activated in most of the cancer datasets. The genes of the module 
are mostly involved in cell division and genetic stability, representing a cell proliferation 
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Figure 4.3 The linear relationship between network size S and cluster diameter D. Each 
dot in the figure corresponds to a connected network of size > 4. The line is fitted using 
networks with size > 10. 
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signature, a key feature of cancer. Figure 4.4c shows a network module which tends 


to be activated in only solid tumors. The genes of the module are mostly involved in 
cell adhesion and organogenesis, which is specific to solid tumor versus blood cancers or 
neoplastic cell lines. In the next section, we will detail two network modules which are 
activated predominately in breast cancer data sets. 

4.2.3 Network modules in the breast cancer cluster 

Our analysis resulted in a second-order cluster containing two connected network modules 
that tend to be more active in all seven breast tumor datasets relative to the rest of the 
datasets. The average correlation of these modules in breast tumor and other cancer 
datasets are 0.49 and 0.23 respectively (the t-test of co-expressions between breast tumor 
and the rest of cancer datasets gives a p-value of 1.56 x 10“®^). 

4.2.3.1 A tumor suppressor network related to PDGF superfamily 
signaling 

The module in Figure 4.5a contains 52 genes. Most of them are extracellular or mem¬ 
brane proteins, and 23 genes have previously been found to be involved in breast cancer. 
Following are examples of genes in the breast tumor suppressor modules that known to 
related to breast cancer : connective tissue growth factor (CTGF) is over-expressed in 
breast cancer cell lines with increased metastatic activity [KSS'’“03]; Fibulin 1 (FBLNl) is 
implicated in immune response against breast cancer [PAF'’'03], and one of its splice vari¬ 
ants is over-expressed in breast [BMM^05]; the cysteine-rich secreted protein (SPARC) 
plays a crucial role in tumor development in breast cancer [WDJB'’'05]; GASl is also found 
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Figure 4.4 General cancer and solid tumor network modules. (A) A network module 
that tend to activate in most of cancer datasets, consisting 24 genes and 28 edges. Average 
correlation across all data sets is 0.42. Most of genes in the module are related to cell 
division and genetic stability.(B) Another network module that is activated in most of 
cancer datasets, consisting 9 genes and 9 edges. The module is located in the same second 
order cluster as the one in figure 2a. Its average correlation across all datasets is 0.39. 
Most of genes in the module are related to nucleobase, nucleoside, nucleotide and nucleic 
acid metabolism. (C) Left : a network that tends to be activated only in solid tumor 
datasets. Right, the co-expression heatmap of the edges across datasets. Six datasets 
are not shown in the heatmap due to lack of valid co-expression estimations. Average 
correlation in solid tumor datasets and other datasets are 0.61 and 0.17, respectively. 
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to be induced in apoptotic mammary gland cells [SBB'*'05]; Cysteine-rich angiogenic in¬ 
ducer 61 (CYR61) is involved in the proliferation, cell survival, and Taxol resistance of 
breast cancer [MVM'*“04]; finally, in the case of a hub gene LRPl (low density lipoprotein 
receptor-related protein 1), the T allele of the C766T polymorphism is associated with 
an increased risk of breast cancer development [BJZ^OS]. 

Among the 52 genes, 16 are involved in cell adhesion (p-value 7.4 x 10“^^ ), and 14 are 
involved cell-cell signalling (p-value 7.9 x 10“® ), suggesting a role in tumor invasiveness 
of the module [RLS+04]. 

General cancer and solid tumor network modules Most interestingly, we found one 
main function of this network module appears to be tumor suppression via the inhibi¬ 
tion of PDGF superfamily signaling. A hub gene with high degree of this module is the 
gene PDGF receptor-like (PDGFRL) (degree 11). While its precise biological function 
is not known, PDGFRL encodes a 375aa product with significant sequence similarity to 
the extracellular domain of PDGFR. Indeed, mutations in PDGFRL have been found 
in individual cancer samples [KSU^97, LOT'*'99, ALG'’“02, KLK'*'03]. PDGFRL is lo¬ 
cated in chromosome 8p22-8p21.3, where multiple studies have suggested the existence 
of a putative breast cancer tumor suppressor gene [YKL'''96, SWF+00, RASB'''03]. Re¬ 
cently, an in-depth study of the region using microcell-mediated chromosome transfer 
found that indeed PDGFRL expression is decreased in the majority of breast cancer 
cells [SKW^06]. Many genes in this network module have been found to be involved 
in PDGF superfamily signaling. For example, Gysteine-rich protein (SPARG) binds to 
PDGF-AB and PDGF-BB dimers, inhibiting the binding of these growth factors to their 
cell surface receptors [RLIA'’'92] and inhibiting PDGF-induced vascular smooth muscle 
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Figure 4.5 Coordinated breast tumor network modules. (A) A breast tumor network 
module that involved in PDGF signalling. Genes in round shape have promoter regions 
that are predicted to be bound by transcription factor ZNF148. (B) A breast tumor net¬ 
work module related to inflammatory response, located in the same second-order cluster 
as the one in Figure 3a. Genes in round shape have promoter regions that are predicted 
to be bound by transcription factor POU2F1. 
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proliferation [MFK’’'02]. Also, connective tissue growth factor (CTGF) has structural 
similarities with PDGF [KMK'*“05]. Recently, two new PDGF ligands were discovered 
that have a N-terminal complement subcomponent GIR/GIS, UEGf, BMPl (GUB) do¬ 
main [UK05]. Interestingly both GIR and GIS are members of this network module. 
Also, LRPl is a physiological modulator of the PDGF signaling pathway [TMAH05]. In 
total, we found direct Pubmed literature support for 19 out of the 52 member genes to 
be involved in (or related to) the PDGF-superfamily signalling. Furthermore, it is known 
that zinc finger protein (ZNF148) binds to the PDGFR [DBSS'*“05] gene promoter. 
We screened the promoter regions of the 52 member genes, and found that the binding 
sites of ZNF148 are significantly enriched (hypergeometric p-value 0.016). In addition, 
it has been reported that PDGFR positively regulates collagen production (for exam¬ 
ple [GRH+07, ZBG+OO]). Our module showed the breast tumor specific co-expression 
between PDGFRL and collagens GOL3A1, GOL5A2 and GOL6A3. Many of the above 
evidences support the hypothesis that PDGFRL has an agonistic function to PDGFR 
signaling. 

Although abundant evidence suggests the individual involvement of many of these 
genes in tumor suppression, their coordinated function has never been elucidated. Here 
by adopting a network perspective, we have identified their interrelationships and a gene 
(PDGFRL) that may play a central role in this tumor suppressor network. 
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4.2.3.2 A network module related to inflammatory response 


Another identified breast cancer-specific network module (Figure 4.5b) may be involved 
in the coordination of the inflammatory response to cancer pathology. This network con¬ 
sists of 5 genes: tumor necrosis factor receptor superfamily member IB (TNFRSFIB); 
vascular cell adhesion molecule 1 (VCAMl); leukocyte-associated immunoglobulin-like re¬ 
ceptor 1 (LAIRl); and Cathepsin LI (CSTL). They are arranged around the macrophage- 
associated antigen (CD163). Several lines of evidence have implicated these genes indi¬ 
vidually in breast cancer, for example; increased plasma levels of VCAMl is associated 
with advanced breast cancer [SGC'’'06]; genetic variation in TNFRSFIB may predict the 
late onset of breast carcinoma, and relapse and death for patients with breast carcinoma 
[MBBAC05]; finally, the breast cancer cell line exhibiting the highest in vitro invasiveness 
also expressed the highest amount of CTSLl splice variant L-A3 [CKSL06]. 

Most of the genes of this module are related to tumor necrosis factor (TNF), an 
inflammatory cytokine. It has been reported that activation of rat CD 163 on peritoneal 
macrophages induces the production of pro-inflammatory mediators including TNF 
[PFD+06]; TNF directly interacts with TNFRSFIB [MERS96] and is a mediator of TNF 
function in the mouse ovary [GRP'’“07]. Finally, transcription of VCAMl in endothelial 
cells can be induced by TNF [NRT+95]. 

A transcription factor, octamer-binding transcription factor-1 (POU2F1, also known 
as Oct-1) has been predicted to bind promoter region of VCAMl, LAIRl and CTSL 
(hypergeometric p-value 0.012). The binding of POU2F1 to VCAMl promoter is indeed 
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supported by the literature [SRJ+98,SOG+03,OHB+02]. Also, POU2F1 has been found 
to bind SP3 [LLC'''03], which is reported to activate the transcription of CSTL [SR04]. 

The tight and coordinated expression of the genes in this network module reveals 
an induced inflammatory response that may be important in breast tumor progression 
[Cou02]. 

4.2.4 Identification of pathway coordination beyond co-expression 

A major advantage of second-order clustering is that it can identify functionally related 
genes beyond co-expression, as illustrated in Figure 4.1b. We elaborate on this point in 
this section. To allow readers to easily assess the magnitude of the correlation, only in 
this section we use the Pearson correlation to measure the co-expression level. 

Based on the definition of second-order clustering, connected network components 
within the same second-order cluster show coordinated activities, which implies their 
functional relevance. In the example of the two modules in the general cancer cluster in 
Figure 4.4a and 2b, each module may play different roles in the regulation of specific 
biological processes - cell division and nucleic acid metabolism, respectively; the latter 
is clearly required for cell division. Given the fact that these processes belong to the 
same second-order cluster, they may represent facets of the same underlying neoplastic 
process. However, member genes of the two modules exhibit distinct expression patterns: 
The average Pearson correlation between genes of the two modules is only 0.13. 

As another example, the two breast cancer modules described in last section are also 
related. Indeed, the collaboration of PDGF signalling and TNF have long been known to 
be required for tissue repairing [OMF04], and their abnormal expression play important 
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but partially defined roles in breast tumor development and progression [CMI'*‘06]. On 
the other hand, member genes of the two modules exhibit relatively distinct expression 
patterns. For example, two hub genes of modules, PDGFRL and CD163, also show very 
weak expression similarities across all seven breast cancer datasets: the average Pearson 
correlation between these two genes is only 0.24. 

Besides the above examples, we found coordinated modules within the majority of 
identified second-order clusters. Overall, from the total 162 second-order clusters, 25% 
give rise to more than one connected network module. To estimate the amount of cross 
module co-expression within second-order clusters, for each second order cluster, we first 
determined the active cancer datasets, in which the average Pearson correlation of gene 
pairs in the cluster is greater than 0.5. For 72% of those module pairs within the same 
second-order cluster, the average gene pairwise Pearson correlation between modules in 
the active datasets is less than 0.5 (normalized Percentage Bend correlation approximately 
< 0.35), and for 30% module pairs the cross-module average gene pairwise Pearson cor¬ 
relation is less than 0.3 (normalized Percentage Bend correlation approximately < 0.19). 

Furthermore, even genes in the same network module are not necessarily highly co¬ 
expressed when the module is active, despite their high degree of functional homogeneity, 
as discussed previously. We found that in 32 of 224 (14 %) modules, the average pairwise 
Pearson correlation of any two genes in the module is < 0.5 in the corresponding active 
datasets. Such modules could therefore easily be overlooked by traditional clustering 
methods. 
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4.3 Discussion 


The rapid accumulation of microarray data provides unprecedented opportunities to study 
the molecular mechanisms underlying disease pathogenesis and progression. Although 
many studies utilized multiple microarray datasets to derive consistent lists of genes 
specific for (subtypes of) cancer [RBR+02,RYS'''04,KCYCGK+04], little attention has 
been paid to derive genetic networks characterizing different types of cancer. Segal et al. 
[SFKR04] used predefined biologically meaningful gene sets including known biological 
pathways, and have successfully identified activated or repressed biological modules in a 
wide variety of neoplastic conditions. The approach, however, relies on the knowledge of 
pre-defined biological modules and has limited use in the discovery of novel association 
between genes. A recent study by Choi et al. [CYYK05] compared the two co-expression 
networks summarized from 10 tumor and normal datasets, respectively, and have identi¬ 
fied functional differences between normal growth and cancer in terms of gene coexpres¬ 
sion changes in broad areas of physiology. However, due to the multifaceted nature of 
cancer, interactions in such a derived summary network may not be simultaneously active 
in individual datasets, i.e. specific cancer conditions. 

In this study, we propose an unsupervised method that integrates both co-expression 
dynamics and network topology information to characterize cancer (subtype) specific 
network modules. The identified modules, such as modules activated across all can¬ 
cer subtypes or only in solid tumors, are novel, but consistent with known molecular 
mechanisms. Importantly, we have discovered a potential tumor suppressor network par¬ 
ticularly active in breast tumors, and provide compelling evidence that the hub gene 
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PDGFRL is a true tumor suppressor gene. Compared to commonly used differential or 
co-expression analysis, our approach has the following advantages; (1) our unsupervised 
approach simultaneously discovers network modules and the conditions (e.g. cancer sub- 
types) in which they are activated, thus providing new insights into the complex cellular 
mechanisms that characterize cancer and cancer subtypes. (2) Compared to existing 
approaches [CYYK05, LCL+06, CBCB04, SKFW03, BTS'''00] which can only identify 
densely connected network modules based solely on network topology information, our 
approach incorporating co-expression dynamics information (second-order similarity) can 
extract more diverse types of modules regardless of network density. It is known that 
many biological pathways do not necessarily form densely connected modules. (3) Our 
approach can reveal coordination of pathways beyond co-expression. (4) Our method can 
be applied to any types of molecular networks beyond co-expression network, when data 
of multiple networks under different conditions are available. 

In the current framework, the selection of biologically meaningful modules still need 
certain amount of manual intervention from biology expert. We are looking for more 
systematic ways for module selection, especially by putting the framework into the context 
of network statistics to improve the robustness. For example, although the scaling model 
we constructed is based on direct observation of the distribution of network properties 
S and D, their log-linear relationship suggests that it should be straightforward to make 
use of the exponential random graph models that have been used in recent years to 
study statistical aspects of networks [CSW05]. In essence, such models linearly combine 
network properties and assign the probability of observed networks as the exponential of 
such linear combinations. Integrating this work with the exponential family probabilistic 
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models may provide both better estimation of the coefficients in the linear scaling model 


and more accurate selection of network modules via hypothesis testing. We intend to 
explore this direction in future work. 

The choice of datasets depends on the research question. Ideally there should be a 
balanced and sufficient sampling of different phenotypes (in particular different tissues 
for this cancer study). Particularly, a paired Wilcoxon test between cancer and normal 
samples of the same tissue would significantly eliminate the amount of tissue-specific co¬ 
expressions. However, due to the limited amount and the heterogeneity of existing data 
it is currently impractical to achieve this goal. Using a weighted sampling scheme could 
potentially bypass the imbalance effects. We aim to investigate this strategy, and use 
statistical models of the correlation values to determine the weighting factors. 


4.4 Methods 

4.4.1 Datasets 

We curated 32 cancer and 23 non-cancer human gene expression datasets mainly from the 
Stanford Microarray Database (SMD) and Gene Expression Omnibus (GEO) databases, 
each containing more than 15 microarrays, on either Affymetrix or cDNA platforms. In 
each dataset, if there are multiple probes that correspond to the same gene, we choose the 
one that contains the least amount of missing values. For datasets containing absolute 
expression measurements, we convert all values < 10 to 10, then perform a base 2 log 
transform. Estimation of Pairwise gene co-expression We used Percentage Bend Correla¬ 
tion [Wil05] (with j3 = 0.05) to obtain a robust correlation estimate. Percentage Bend 
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Correlation first detects outliers in expression values of each gene then reduces the effects 
of those outliers in the correlation calculation. Only gene pairs with a large number of 
valid samples m > 15 are used to calculate correlation. To make the correlation estimates 
comparable across different datasets of variable sample sizes and among different gene 
pairs of different amount of missing expression measures within the same dataset, we 
performed Fisher’s z-transform [And] to reduce sample size effect. Given a correlation 
estimate r and sample size n, the Fisher’s Z scores (divided by its theoretical standard 
deviation) is calculated as z = log (), which theoretically has an asymptotically 
standard normal distribution. Note that sample size n may be different from gene pair to 
gene pair due to missing values, and from dataset to dataset. In reality, we observed that 
the distributions of the z-score are still different from dataset to dataset: we therefore 
normalized z-scores to enforce the standard normal distribution. After that, standardized 
correlations r' are obtained by inverting the z-score with a fixed n of 30. 

4.4.2 Select differentially co-expressed gene pairs 

We define a gene pair to be differentially co-expressed between cancer and non-cancer if 
it satisfies the following two criteria; (1) their expression correlation in cancer datasets 
is sufficiently strong (can be either positively or negatively high). This is done by 
setting threshold for average summed square of correlations in cancer datasets, i.e, 

gene pair i, where there are c valid correlation estima¬ 
tions (c = 32 if there are no missing values) and k is dataset index corresponding to all 
valid correlation estimations; and (2) the Wilcoxon ranksum test of correlations between 
cancer and non-cancer datasets gives a p-value < 0.01. 
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4.4.3 Identify conditionally activated network module candidates 

We hierarchically clustered the differentially co-expressed gene pairs based on their ex¬ 
pression correlation prohles using the CLUSTER program [dHINM04] with complete 
linkage and Euclidean distance. The Euclidean distance is averaged to provide a simple 
estimation given the existence of missing correlations (due to the missing value problem). 
dij = where and are the correlations of gene pairs i and j in 

the dataset k, respectively, and c is number of valid correlations. In the hierarchical tree, 
each leaf node represents a gene pair, and each inner node corresponds to a second-order 
cluster of gene pairs (edges) which may comprise zero, one, or more connected network 
components. In cases where the size of the differential network is too big to be processed 
using hierarchical clustering (HC), the gene pairs were first separated using k-means clus¬ 
tering then processed the smaller clusters separately by hierarchical clustering. As for our 
experience, the biologically meaningful modules normally contain less than a few hundred 
edges, thus k-means clustering will keep most of modules intact. 

4.4.4 Gene ontology function and transcription factor enrichment of 
modules 

The functional enrichment analysis is done by the hypergeometric test on genes. We 
selected 419 Gene Ontology (GO) functions (i.e. biological process terms) which are 
4 levels below the root in the GO hierarchy. Each gene may be directly or indirectly 
associated with some of these functions. A set of genes will be considered to have a 
enriched function when (1) the functional homogeneity modeled by the hypergeometric 
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distribution [WHD’*'02] is significant at a significance level 0.01 and (2) there are at least 
2 genes in the set are associated with the function. 

4.4.5 Identification of transcription factor binding 

lOkb upstream sequences for each gene were obtained from NCBI Gene database. Af¬ 
ter applying RepeatMasker [SHG04], we used the MATCH program of TRANSFAC 
[MFG'’'03] (version 9.2) to scan the sequences for the presence of transcription factor 
binding sites based on position weight matrices. We used vertebrate-specific matrices, 
and chose cut-offs to minimize the sum of false positives and false negatives. We kept 
only the top 3,000 hits per matrix, sorted by the matrix similarity score. Altogether we 
obtained 349,178 predicted transcription factor target relationships of 180 transcription 
factors. A hypergeometric test was performed for each network module to search for 
over-represented transcription factor binding sites. 
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